Data Visualization in the Newsroom

74
Data visualization in the newsroom { “presented by”: “carl v. lewis”, “for”: “the florida times-union”, “slides”: “bit.ly/NIXkOD”, “email”:“[email protected]}

description

 

Transcript of Data Visualization in the Newsroom

Page 1: Data Visualization in the Newsroom

Data visualization in the newsroom

{

“presented by”: “carl v. lewis”,

“for”: “the florida times-union”,

“slides”: “bit.ly/NIXkOD”,

“email”:“[email protected]

}

Page 2: Data Visualization in the Newsroom

What is data visualization?

•Data itself is the story; standalone narrative.

•Interactive, communicative, visual.

•Ranges from simple (charts) to complex (database-driven applications).

•Both a technique and a format.

•Both entertaining and factual.

• See: “The Many Words for Visualization”

Page 3: Data Visualization in the Newsroom

The history of data journalism

•Grew out of CAR (computer assisted-reporting) tradition

•John Snow’s 1854 cholera map

•Has coincided with the era of “Big Data”

Page 4: Data Visualization in the Newsroom

On the emergence of the field of data journalism:

•"When information was scarce, most of our efforts were devoted to hunting and gathering. Now that information is abundant, processing is more important." –Phillip Meyer, UNC Chapel Hill

Page 5: Data Visualization in the Newsroom

On the growing importance of data-driven journalism:

•“Journalists need to be data-savvy . . . Data-driven journalism is the future.” –Sir Tim Berners Lee.

•“The explosion of Web-based tools and ways of sifting through and sharing data has created something approaching a revolution, and the potential benefits for journalism are only just beginning to reveal themselves.” –Matthew Ingram

Page 6: Data Visualization in the Newsroom

What data journalism is not:

• Simply incorporating public data into your textual narrative

• Infographics

• Illustration

• Resource-intensive

• Just about numbers and programming

• Just about making data flashy

Page 7: Data Visualization in the Newsroom

What data journalism is:

• Visual

• Often evergreen

• Transparent – direct access to primary source

• Credible

• Engaging

• A good business model

Page 8: Data Visualization in the Newsroom

Hans Rosling

http://www.youtube.com/watch?v=jbkSRLYSojo

Page 9: Data Visualization in the Newsroom

Democratization of data journalism

• Free and open-source tools (Google Drive, JavaScript libraries, etc.).

• Open Data laws.

• “Anyone can do it. Data journalism is the new punk.” -Simon Rogers, The Guardian

Page 10: Data Visualization in the Newsroom

The job of the data journalist

• Part statistician, part journalist, part programmer.

• “We're statisticians. We don't program.”

• “We’re programmers. We don’t report.”

• “We’re journalists. We don’t code.”

Page 11: Data Visualization in the Newsroom

Notable examples of data visualization

• “Mapping America: Every City, Every Block,” NYTimes.com.

• “Where Does My Money Go?”, Open Knowledge Foundation.

• “Illinois school report cards,” Chicago Tribune

• “We Feel Fine,” Jonathan Harris

• “Top Secret America,” The Washington Post

Page 12: Data Visualization in the Newsroom

News organizations to follow forinnovative data projects

Page 13: Data Visualization in the Newsroom

What are your favorite visualizations?

Page 14: Data Visualization in the Newsroom

When to use data visualization:

• Show change over time

• Comparing discrete values

• Showing connections and flows

• Showing hierarchy

• Browsing large databases

Page 15: Data Visualization in the Newsroom

When not to use data visualization:

• When text or multimedia tells story better

• When you have very few data pints

• When there is no statistical significance

• When a map is not a map

• When a table would do

Page 16: Data Visualization in the Newsroom

Process of data journalism

1. Research – Think of topic and research factors.

2. Find the data – Locate and retrieve relevant public data

3. Analysis and evaluation – Crunch numbers, look for trends or inconsistencies

4. Visualize – Display the data in appropriate manner

Page 17: Data Visualization in the Newsroom

II. Mining public dataResearch and retrieval

Page 18: Data Visualization in the Newsroom

Research

1. Think of a topic – what factors influence it?

2. What public data might shed light on those factors?

3. Seek out the data

Page 19: Data Visualization in the Newsroom

Locating public data• Thousands of public “data dumps” by

government bodies and nonprofits.

• Most commonly in delimited spreadsheet format (look for .csv, .xls), sometimes in XML and JSON.

• For geographic data, look for .kml or .shp

• Can be found directly at source or by search engine keyword

Page 20: Data Visualization in the Newsroom

Search tips for data retrieval• If you don’t know which source to

look to find your data, an initial Web search might help.

• After your keywords, type “filetype:XLS”, “filetype:CSV”, or whatever the extension is of the data you’re seeking, and you’ll see only files of that type from across the Web.

• If you get no results, try broadening your search term to locate sources that cover the general discipline (i.e. instead of “malaria deaths,” try “public health data”)

Page 22: Data Visualization in the Newsroom

• Florida’s “Sunshine” law requires all state agencies to provide open access to public records, including data.

• Chapter 119 of Florida State Statutes mandates that “any records made or received by any public agency in the course of its official business are available for inspection, unless specifically exempted by the Florida Legislature.”

Florida public data sources

Page 23: Data Visualization in the Newsroom

• Dozens of useful open data sources maintained by Florida government agencies, including TransparencyFlorida.gov, FloridaHasARightToKnow.com and MyFlorida.gov

• Full-list of state-maintained databases by topic here.

• A few state-maintained databases worth mentioning: the Division of Elections’ campaign finance data, the DOE’s test score reports and the Department of Law Enforcement’s arrest and officer reports.

Florida public data sources

Page 24: Data Visualization in the Newsroom

Florida public data sources

• A number of advocacy groups also maintain useful, downloadable statewide databases:

• FloridaOpenGov.org, which focuses on public employee payroll data.

• FloridaRedistricting.org, which provides demographic data (.csv) and geographic polygons (.shp) for new district boundaries.

• Florida Housing Data Clearinghouse, which provides regularly updated property values, housing data (.xls).

(for even more, see my semi-exhaustive list with descriptions here).

http://www.duvalelections.com/content.aspx?id=235

Page 25: Data Visualization in the Newsroom

Georgia public data sources• Although Georgia has no law

requiring all government agencies to make public data accessible online, many do anyway.

• In 2008, the Transparency in Government Act expanded the public data site, Open.Georgia.gov, to include all three branches of government, regional education service agencies, local boards of education, and transactions made by the General Assembly.

Page 26: Data Visualization in the Newsroom

Georgia public data sources

• A comprehensive list of downloadable databases from state agencies in Georgia can be found here.

• The State Ethics Committee has made all campaign finance reports, lobbyist reports and campaign contributions available in downloadable spreadsheets.

• OASIS provides a set of web-based tools to browse the Georgia Department of Public Health’s Data Warehouse, and download the data yourself if you wish.

Page 27: Data Visualization in the Newsroom

Locating geographic data• Most geographic data available

as TIGER/Line Shapefile packages (archives containing .shp, .dbf, .prj, .xml, .shx) from U.S. Census Bureau.

• Google also hosts a directory of .kml files for most geographic boundaries here.

• Alternatively, Florida and Georgia GIS data can be found at FGDL.org, Geoplan and Data.GeorgiaSpatial.org.

Page 28: Data Visualization in the Newsroom

What to look for• Most numeric spreadsheet data comes either as a comma-separated value

(.csv) or Microsoft Excel (.xls) file. Example of .csv structure:“Name”,“Date”,“Address”,”Zip”,”State”,”Country”,

• XML (eXtensible Markup Language) stores data hierarchically for the Web, and is good for building news applications because of its broad interoperability.

<menu id="file" value="File"> <popup> <menuitem value="New" onclick="CreateNewDoc()" /> <menuitem value="Open" onclick="OpenDoc()" /> <menuitem value="Close" onclick="CloseDoc()" /> </popup></menu>

• JSON (JavaScript Object Notation) – Similar to XML in structure, but has a “lighter” punctuation, based on JavaScript conventions. May eventually replace XML as standard. {"menu": {

"id": "file", "value": "File",

"popup": { "menuitem": [ {"value": "New", "onclick": "CreateNewDoc()"}, {"value": "Open", "onclick": "OpenDoc()"}, {"value": "Close", "onclick": "CloseDoc()"} ] } }}

Page 29: Data Visualization in the Newsroom

Scraping other sources

• Scrape data from an HTML table with simple Google spreadsheet formula: =ImportHtml("http://the-url-goes-here", "table", 0)

• For database of HTML tables, try Haystax.

• For PDFs, try CometDocs.

• Scrape webpages by running or creating Python script at ScraperWiki.

Page 30: Data Visualization in the Newsroom

APIs for data retrieval

• APIs (application programming interfaces) are how many websites and services share content with one another.

• Allows a computer system to fetch, interpret and use data created on another system, even if it used a different programming language or structure.

• Examples: Twitter Search API, Google Maps API, NYTimes Campaign Finance API.

• Usually returns data as XML, JSON or .txt

• Often requires use of an API key.

Page 31: Data Visualization in the Newsroom

II. Analyzing and refining public data

Page 32: Data Visualization in the Newsroom

Manipulating datasets• Data rarely ready for analysis and visualization out-of-the-

box (hence “raw data”).

• Spreadsheet applications most common and easiest way to work with data (Excel, Google Spreadsheets).

• Allow for complex calculations, formulas, sorting.

• Compatible with a variety of file formats (.xls, .ods, .csv, .txt, .tsv).

• Scripts may also be written to automate bulk manipulation (Python).

• R Project (r-project.org)

Page 33: Data Visualization in the Newsroom

Data analysis

• To figure out what your data says, you’ll need to crunch the numbers.

• Statistical significance is litmus test.

• Skewed or normal distribution? Why?

• Outliers? If so, error or unexplained factor?

Page 34: Data Visualization in the Newsroom

Benchmarks for analysis• Mean (μ) simplest to calculate, but

susceptible to errors caused by outliers.

• Median usually a better metric in determining conclusion, especially with skewed distribution.

• If mean=mode, no skewness.

• Standard deviation (σ) measures reliability of data set.

• Z-Score = how many standard deviations a value is away from the mean and, thus, its likelihood of being an outlier.

standard deviation

mean

z-score

Page 35: Data Visualization in the Newsroom

Calculating values in Excel

• Mean: =AVERAGE(A1-A27)

• Median: MEDIAN(A1-A27)

• Standard deviation: STDEV(A1-A27)

• Z-score of a given value: Subtract mean of dataset from value. Divide result by the standard deviation

Page 36: Data Visualization in the Newsroom

Other commonly used Excel formulas

• Concatenate to merge multiple columns.

• MID to split columns.

• Percent change to display relative change over time =(new_value-original_value)/ABS(original_value)

• See this guide of helpful Excel tricks for data journalists, compiled by Mary-Jo Webster of St. Paul Pioneer Press: https://docs.google.com/file/d/0ByLyArAQRhaBNDc3NjJjYTUtY2U0Yi00NmIwLThkNTgtYzNlYThmNGE1ZTEz/edit

Page 37: Data Visualization in the Newsroom

Refining and cleaning data• Sometimes Excel and Google

Spreadsheets aren’t enough, especially when working with large datasets.

• Google Refine – free tool that lets you explore, power sort and process data.

• Useful for finding and fixing errors and inconsistencies, “power tool for working with messy data.”

• Facets to sort data

• Cleaning with clusters

• Shan Carter’s Mr. Data Converter to convert spreadsheets to more web-friendly format.

Page 38: Data Visualization in the Newsroom

Other data analysis tips and tricks• Put field names in first row.

• Put geographic data in first columns

• When you have two different datasets, a good tool to merge them is Google Fusion Tables (make sure they share a common attribute).

• Never round until the end of calculations. Round to two decimal points for visualization purposes.

• Cut and paste calculations into a new column as values only.

• Know the principle data types (integer, real, string, boolean), and make sure numeric data is classified as either integer (whole numbers only) or real (any value).

Page 39: Data Visualization in the Newsroom

III. Visualizing your data

Page 40: Data Visualization in the Newsroom

Planning your visualization

• Identify your key message

• Choose the best data series to illustrate your point

• Consider the number of points in the data

• Think about complementary/supporting datasets you can incorporate, e.g. sanitation with poverty.

• Plan for user interaction, i.e. visual feedback.

• Make numerical changes to raw data to enhance your point, e.g. absolute values vs. percent change

• Brainstorm potential technologies

• Consult experts on topic to back up your interpretation of data

Page 41: Data Visualization in the Newsroom
Page 42: Data Visualization in the Newsroom

Choosing the right type of visualization

• Change of single variable over time: line chart.

• Comparison of single variable among multiple classes: bar chart.

• Two variables: scatter plot, bubble chart.

• Hierarchical data: treemap, bubbletree.

• Area charts for area only

• Makeup of whole: pie chart.

• Distribution: histograms, box-and-whisker plots.

• Geographic data (point, polygon, chloropleth and symbol maps).

• Records: searchable database.

• Chronological data: timeline, sparklines.

• Other possibilities: matrices, heatmap, games, slopegraphs, stepper graphics,

Page 43: Data Visualization in the Newsroom

Visualization design principles

• Typography: clear, consistent, not distracting.

• Use bold, mix of serif/sans-serif to provide emphasis.

• Don’t set type at an angle

• Color: Let color correspond to variable, design for accessibility, choose from same side of color wheel, consider cultural associations but avoid thematic palletes. Use Adobe Kuler or 0to255.com

• Visual overload, emotional design, skewmorphism.

No white type on black background

No angled type

Page 44: Data Visualization in the Newsroom

• Some guidelines for graphical integrity, according to Edward Tufte in The Visual Display of Quantitative Information:

1. Representation of numbers should be directly proportional to numerical qualities represented.

2. Clear, detailed labeling throughout.

3. Show data variation, not design variation.

4. Avoid excessive and unnecessary use of graphical effects

What Edward Tufte calls “the worstvisualization ever published.”

Visualization design principles

Page 45: Data Visualization in the Newsroom

• Design for the eye

• User should be able to discern key message visually.

• Design for interaction

• Highlighting and details on demand (example)

• User-driven content selection (example)

Visualization design principles

Page 46: Data Visualization in the Newsroom

Visualization design principles

Page 47: Data Visualization in the Newsroom

Awful

Bad, but better

Visualization design principles

Page 48: Data Visualization in the Newsroom

Awful, but better

Not bad

Awful

Visualization design principles

Page 49: Data Visualization in the Newsroom

What’s wrong with this infographic?

Visualization design principles

Page 51: Data Visualization in the Newsroom

Wireframing/prototyping

• Follow a structured grid system (i.e., 12 column, 960px grid – see 960.gs and Subtraction).

• Very selectively, you can break the grid to emphasize a certain visual element.

• Sketch out/prototype your wireframe on paper first (print templates such as this)

Page 52: Data Visualization in the Newsroom

Selecting tools/technologies

• A wealth of free, open-source data visualization tools and libraries exist to shorten development times

• Examples: Google Visualization API, Google Fusion Tables, Highcharts.js, CartoDB, d3.js, Tableau Public.

• For everything else, HTML5 + CSS + JavaScript

Page 53: Data Visualization in the Newsroom

IV. Building a Web app

Page 54: Data Visualization in the Newsroom

Web app anatomy

Three components of a Web app:

1. HTML (structure)

2. CSS (styles)

3. JavaScript (interactivity)

Page 55: Data Visualization in the Newsroom

Parts of an HTML fileAn HTML file is made up of:

1. Doctype declaration

2. Head <head>

3. CSS/JavaScript references

4. Title <title>

5. Body <body>

6. A Div container

7. Divs (IDs and classes)

Page 56: Data Visualization in the Newsroom

Parts of a CSS file

A CSS file is made up of:

1. Container ID

2. Default paragraph (p) style

3. Default H1,H2, etc. styles

4. Default .body style

5. Styles for all divs

Page 57: Data Visualization in the Newsroom

V. Maps

Page 58: Data Visualization in the Newsroom

Maps 101• Interactive maps combine

geocoded data – points or polygons – along with metadata and/or numeric data.

• KML (keyhole markup language) quickly becoming popular file format, but Shapefile (shp.zip) is still the most widely available

• Geographic data can either be geocoded, downloaded from the Web, or custom-drawn.

• Good puveyor of news maps: The Texas Tribune.

Page 59: Data Visualization in the Newsroom

Mapping services and libraries

• Google Fusion Tables – Quick, versatile and classic maps that integrate seamlessly with the Google Maps JavaScript API.

• CartoDB – A newer open-source tool much like Fusion Tables, but with a better looking out-of-the-box experience.

• Leaflet – An open-source, client-side mapping library with an API that allows you to achieve a number of advanced features. Plays nicely with Fusion Tables and CartoDB-hosted maps. Part of CloudMade suite.

Page 60: Data Visualization in the Newsroom

Handy desktop mapping software

• qGis – Free program that supports almost every conceivable map file type, and allows you to add or manipulate vector data, which can then be then exported as a KML or Shapefile package.

• Tilemill – A map creation and styling software; ideal for those with little programming experience. UTF-grid enabled tilesets only.

Page 61: Data Visualization in the Newsroom

Primary map types• Chloropleth – Colors

for each geometry correspond to numeric values of a given variable.

• Point – Locations on a map displayed by geocoded markers.

• Less frequently: proportional maps and geo maps.

Chloropleth map of Georgia voter turnout

Point map of Jacksonville polling locations

Page 62: Data Visualization in the Newsroom

Tips and tricks

• If you have street address data, you can use BatchGeocode to convert them to lat-long coordinates.

• For chloropleth maps,

• Include no more than five fill colors or “buckets”

• Don’t define an equidistant color ramp; use ColorBrewer instead.

• Use MarkerClusterer when there are too many points for certain zoom levels.

Using ColorBrewer to define an accurate, accessible color ramp.

Using MarkerClusterer to cluster points at further zoom levels.

Page 63: Data Visualization in the Newsroom

Tips and tricks

• To convert Shapefiles so they can be imported into Fusion Tables, either use Shape to Fusion, or export it as KML from CartoDB.

• Before using the embed tool in Fusion Tables or CartoDB, make sure the map is centered where you want it.

• Ensure your map is set to “Public.”

Export a Shapefile as KML in CartoDB.

Making your map public in Fusion Tables

Page 64: Data Visualization in the Newsroom

V. Charts

Page 65: Data Visualization in the Newsroom

Charts

• Basic building block of visualization

• Simple, but also easy to mess up.

• Should always be interactive.

• Should always include data source.

• Should always include a legend.

• Unless necessary, only show labels on mouseover.

Page 66: Data Visualization in the Newsroom

Interactive charting tools

• Out-of-the-box: Google Drive charts, infogr.am.

• More advanced: Google Code Playground.

• Most agile: Highcharts.js.

• Most extendible: Tableau PublicA combo chart made using Highcharts.js

Page 67: Data Visualization in the Newsroom

Charting best practices• Color: Pick palette of no more

than 3-4 colors from same side of color wheel.

• Increments: Use natural-increments like (0,2,4,6...) instead of, say, (0,3,6,9...)

• Scale: Don’t plot two unrelated series with one scale on left and one on right.

• Style: Flat and simple. No 3D effects, shadows, narrow bars or distracting shading.

Don’t plot two different variables on same scale.

Bars too narrow Distracting shading

Misleading 3D effects Pointless shadows

Source: The Wall Street Journal Guideto Information Graphics, Dona M. Wong.

Page 68: Data Visualization in the Newsroom

Charting best practices

• Always set the baseline to zero.

• Always order starting with greatest value

• Use broken bars sparingly

• No more than five slices on pie charts; no “donut” pie charts.

• No more than 3-4 lines on line chart

Wrong order Right order

Wrong baseline Right baseline

No donut-pies

Source: The Wall Street Journal Guideto Information Graphics, Dona M. Wong.

Page 69: Data Visualization in the Newsroom

V. Programming and beyond

Page 70: Data Visualization in the Newsroom

Utilizing JavaScript/HTML5 libraries

• Together, JavaScript, HTML5 and jQuery have expanded boundaries of data visualization

• Abundance of open-source libraries and packages mean less programming required to produce unique, interactive visualizations.

• Examples: Timeline.js, Bubbletree.js, Raphael.js, ProPublica tools

Page 71: Data Visualization in the Newsroom

The HTML5 revolution

• Adobe Edge for HTML5 development; end of Flash’s reign

• Platform-agnostic, mobile-first movement

• Forking resources and packages off GitHub

Page 72: Data Visualization in the Newsroom

Pushing the limits

• RaphaelJS for easier manipulation of serialized vector graphics

• Other boundary-pushing data visualization projects: Processing!, Gephi, d3.js, IBM’s Many Eyes. A network map produced using D3.js

Page 73: Data Visualization in the Newsroom

Helpful resources and communities

• Blogs/Tutorials: FlowingData.com,Vis4.net,Driven-by-data.net, Chryswu.com, datavisualization.ch

• Books: The Data Journalism Handbook, O’Reilly Media. Flowing Data Guide to Visualization, Chris Wyu. The Wall Street Journal Guide to Information Visualization, Dona M. Wong.

• Communities: visual.ly, Hacks/Hackers, NICAR.

Free data journalism handbook from O’Reilly Media

Page 74: Data Visualization in the Newsroom

For slides and list of links, http://bit.ly/NIXkOD

@carlvlewis