Visualizing Relationships: Journalistic Problems in a Digital Age
-
Upload
3pillar-global -
Category
Documents
-
view
214 -
download
0
description
Transcript of Visualizing Relationships: Journalistic Problems in a Digital Age
Summary
1. Introduction2. The problem we are solving3. Involved issues4. Problems we found5. The Challenge
Who are we?
Mariano Blejman is a technology editor and youth editor in Argentine newspaper Página/12, and Hacks/Hackers Buenos Aires co-founder. @blejman
Marcos Vanetta is a biomedical engineer. Software developer at 3PillarGlobal and hacker at Hacks/Hackers Buenos Aires. @malev
The problem
● 1976 A dictatorship started in Argentina.● 30,000 persons were kidnapped and
disappeared.● 1985 First trials happened in Argentina.
They judged the bad guys but we have to stop.
● 2003 Justice start judging the bad guys again.
● 2012 Large amount of judicial documents.
No one can read all of them
Involved issues
● Semantic Analytics● Ontology● Data Mining● Social Network Analysis● Visualizations
Who were dealing with documents?DocumentCloud,
Overview, Open Calais, NLTK, Gate
First approach
Read all the documentsSoftware solution based on regular expressionsRuby, Padrino and MySQL database
def self.extract_plain_text(path) basename = File.basename(path).split('.')[0..-2].join('.') tmp_dir = Dir.tmpdir Docsplit.extract_text(path, :output => tmp_dir, :ocr => false) text = File.open(File.join(tmp_dir, "#{basename}.txt")).read self.clean_text(text)end
The problems we found
● Convert text from pdf files● Extract entities from documents● Parse dates and addresses
● Co-reference names resolution● How to store relations● Documents contextual information● Confidence on data on a crowdsourcing platform.
Visualizing relationships over the time
What do we have now?
Prototype for a single (and local) use case:mapa76
Platform for different use cases:analice.me
The #mozfest challenge
Find a big journalistic issue that involves:● Lot of documents with unstructured data● Lof of data to find inside● What relationships do you wants to find