Leetaru, Kalev: lighntning talk, The GDELT Project
-
Upload
reynolds-journalism-institute-rji -
Category
News & Politics
-
view
37 -
download
1
Transcript of Leetaru, Kalev: lighntning talk, The GDELT Project
Datasets• NEWS: Worldwide local news coverage in 100 languages (65 live
translated) – online news preserved via Internet Archive• TELEVISION: Collaboration with the Internet Archive to process
more than 100 television stations across the US, updating daily• ACADEMIC LITERATURE: 21 billion words covering 70 years
(JSTOR/DTIC/CORE/CITESEER/IA)• BOOKS: Collaboration with Internet Archive and HathiTrust to
process 3.5 million books 1800-2015• HUMAN RIGHTS: Half century of worldwide human rights reports• IMAGERY: Large fraction of global news imagery processed via deep
learning: objects/activities, OCR, logos, facial sentiment, geolocation
Preserving Online News
• World’s largest initiative to preserve online news – partnership with the Internet Archive
• Only program to focus on worldwide local news in local languages• 1.5-2% of news articles disappear within 2 weeks• 5% disappear within a month• Up to 14% gone after 2 months – half with 404 and half ranging from sustained 500’s
to domain removal (popular in some areas of the world)• Of GDELT-relevant coverage, 140,000 articles published today will be gone in 2 months• 14 million GDELT monitored articles disappeared over a 6 month period representing
2x the total output of the New York Times over the last half century• Nepal 2015 Earthquake: preserving coverage of sudden-onset natural disasters
requires “always on” preservation – GDELT preserved 667,000 articles – 225,000 non-English, with top being Nepali