OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of...
Transcript of OpenData, Graphs and do-it-yourself Journalism · 2019. 9. 23. · About Neo4j • Creators of...
OpenData, Graphs and do-it-yourself "Journalism"
Sascha Peukert
1
About Neo4j
• Creators of Neo4j Graph Plattform and
Neo4j - World’s leading open-source graph database
• Company founded 2007 in Sweden
• Today 250+ employees in
San Mateo, London,
Malmö and remote
• You can join us!
3
What’s that graph thing again?
4
Daten!
What’s that graph thing again?
5
Labeled Property Graph Model
( :City {name:“Dresden”} ) <-[ :HAS_SEAT_IN ]- ( :Company {name:“T-Systems” } )
Motivation
6
opendata-illustration by Julie Beck
Agenda
7
https://pixabay.com/photos/files-paper-office-paperwork-stack-1614223/ https://pixabay.com/photos/image-statue-alive-artist-3895819/
Import all the data
https://neo4j.com/developer/data-import/● LOAD CSV for Medium Sized Datasets
● Super Fast Batch Importer For Huge Datasets
● Importing JSON Data from a REST API into Neo4j
● Relational to Graph Import Tools
○ LOAD CSV
○ APOC
○ ETL Tool
○ Kettle
○ Other ETL Tools
○ Import Programmatically with Drivers
8
Import all the data
Cities, postcodes and federal states
CSV file from:https://www.suche-postleitzahl.org/download_files/public/
zuordnung_plz_ort.csv
9
10
Import all the data: cities, postcodes and states
Indexes!
CREATE INDEX ON :City(name);
CREATE INDEX ON :PostCode(code);
CREATE INDEX ON :State(name);
11
Import all the data: cities, postcodes and states
Loading the CSV
LOAD CSV WITH HEADERS FROM
'file:///Users/Sascha/Documents/JUG/zuordnung_plz_ort.csv' AS line
Path to file as string
12
Import all the data: cities, postcodes and states
Loading the CSV & creating
the data
LOAD CSV WITH HEADERS FROM
'file:///Users/Sascha/Documents/JUG/zuordnung_plz_ort.csv' AS line
CREATE ( p:PostCode {code:line.plz} )
CREATE ( b:State {name:line.bundesland} )
13
Import all the data: cities, postcodes and states
Loading the CSV & creating
the data
LOAD CSV WITH HEADERS FROM
'file:///Users/Sascha/Documents/JUG/zuordnung_plz_ort.csv' AS line
CREATE ( p:PostCode {code:line.plz} )
CREATE ( b:State {name:line.bundesland} )
MERGE ( b:State {name:line.bundesland} )
14
Import all the data: cities, postcodes and states
Loading the CSV & creating
the data
LOAD CSV WITH HEADERS FROM
'file:///Users/Sascha/Documents/JUG/zuordnung_plz_ort.csv' AS line
CREATE ( p:PostCode {code:line.plz} )
MERGE ( b:State {name:line.bundesland} )
MERGE ( b )<-[:LOCATED_IN]-( c:City {name:line.ort} )
CREATE ( c )<-[:BELONGS_TO]-( p )
15
Import all the data: cities, postcodes and states
Loading the CSV & creating
the data
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
'file:///Users/Sascha/Documents/JUG/zuordnung_plz_ort.csv' AS line
CREATE ( p:PostCode {code:line.plz} )
MERGE ( b:State {name:line.bundesland} )
MERGE ( b )<-[:LOCATED_IN]-( c:City {name:line.ort} )
CREATE ( c )<-[:BELONGS_TO]-( p )
16
17
Import all the data
Open register / OpenCorporates
JSONL file from:https://offeneregister.de/
18
Import all the data: Open register / OpenCorporates
19
• Simple version from Bert Radke: https://blog.faboo.org/2019/03/handelregister-jsonl/
• Remarks about the data and import:
• 5.305.727 companies & 4.803.514 officers
• Unexpected nulls (some examples)
• “Registered address” is missing on 68.5% of all companies
• “Registered office” is null for one active company...
• 10% of officers don’t have a city set
• Problem: Persons & Cities do not have a unique key in the json
Import all the data: Open register / OpenCorporates
20
Intermediate status
Import all the data
Lobbypedia party donations
Tool:https://lobbypedia.de/wiki/Spezial:Abfrage_ausf%C3%BChren/Parteispenden
JSON files:2000 - 2010https://lobbypedia.de/wiki/Spezial:Semantische_Suche/-5B-5BKategorie:Parteispende-5D-5D-20-5B-5BJahr::2000-7C-7C2001-7C-7C2002-7C-7C2003-7C-7C2004-7C-7C2005-7C-7C2006-7C-7C2007-7C-7C2008-7C-7C2009-7C-7C2010-5D-5D/-3FGeldgeber/-3FParteispende-2FKategorie%3DKategorie/-3FBetrag/-3FEmpf%C3%A4nger/-3FJahr-23-2Dn/-3FOrt/-3FBundesland/-3FBranche/-3FSchlagworte/mainlabel%3D/limit%3D10000/order%3Ddescending/sort%3DModification-20date/offset%3D0/format%3Djson/default%3Dkeine-20Ergebnisse-20mit-20der-20aktuellen-20Auswahl
2011 - 2019https://lobbypedia.de/wiki/Spezial:Semantische_Suche/-5B-5BKategorie:Parteispende-5D-5D-20-5B-5BJahr::2011-7C-7C2012-7C-7C2013-7C-7C2014-7C-7C2015-7C-7C2016-7C-7C2017-7C-7C2018-7C-7C2019-5D-5D/-3FGeldgeber/-3FParteispende-2FKategorie%3DKategorie/-3FBetrag/-3FEmpf%C3%A4nger/-3FJahr-23-2Dn/-3FOrt/-3FBundesland/-3FBranche/-3FSchlagworte/mainlabel%3D/limit%3D10000/order%3Ddescending/sort%3DModification-20date/offset%3D0/format%3Djson/default%3Dkeine-20Ergebnisse-20mit-20der-20aktuellen-20Auswahl
21
Import all the data: Lobbypedia party donations
Names as “join keys” between datasets are… problematic!
22
Import all the data: Lobbypedia party donations
Names as “join keys” between datasets are… problematic!
My solution:
Adding “index nodes” for persons
and relationships that indicate
context closeness
23
Schema graph
24
https://b0ef77c6.databases.neo4j.io/browser/
User: partyPassword: party
Demo-Disclaimer
• All data is from public and open sources or common knowledge
• I did not change those sources nor do I claim them to be correct
• Due to the imperfect nature of the data, the import cannot be perfectly
accurate so do NOT blindly take the outcomes as fact!
25
Takeaways
( Graphs ) -[ :ARE ]-> ( Everywhere)
Use indexes
Expect some data wrangling when working with (open) data
Link to full import script
Play with the data at: https://b0ef77c6.databases.neo4j.io/browser/user & password: party
26
Free O’Reilly Book
neo4j.com/graph-algorithms-book
• Spark & Neo4j Examples• Machine Learning Chapter
Graph & ML Algorithms in Neo4j+35
neo4j.com/graph-algorithms-
book/
Pathfinding & Search
Centrality / Importance
Community Detection
Link Prediction
Finds optimal paths or evaluates route
availability and quality
Determines the importance of distinct nodes in the network
Detects group clustering or partition
options
Evaluates how alike nodes are
Estimates the likelihood of nodes forming a future relationship
Similarity