Visualization of Big Data - Chalmers · Visualization of Big Data . DANYEL FISHER, MICROSOFT...

Visualization of Big Data DANYEL FISHER, MICROSOFT RESEARCH

Contents

Big Data & Visualization Overview

Background, and How We Know What We Know

Design Constraints for Visualizing Big Data ◦ There’s Too Much to Process ◦ You’ll Never See It All ◦ The Rules Change ◦ Streaming Adds New Challenges

The Three Vs Volume Velocity Variety

… and I’ll add one more: “Visitation.” This is what we used to call “Exploratory Data Analysis”, but I want to keep up with the “V” thing.

Defining “Big” Volume

“…200,000 magnetic tape reels which represent over 900 billion characters of data”

1975

Exploration is not presentation EXPLORATION:

Learn about the dataset

Explore multiple hypotheses

Manipulate data freely

May be discarded after completion

Examples: Some of Tableau, PowerView, GGPLOT, etc

PRESENTATION:

Communicate a specific view

Constrain interaction

Visual style important

Examples: visual dashboards, data storytelling

Goals Responsive, exploratory visualization

We’re NOT interested in ◦ Pre-cooked datasets and visualizations ◦ Knowing precisely what you plan to look at / do

“the size of the dataset is part of

the problem”

Problem Space On one PC, it ◦ Run out of screen to draw each data point [106] ◦ Takes a long time to look at every data point [109] ◦ May not be able to store all the data points [1012]

Rendering Problem

x

y

Scatterplot (at least one pixel per point)

Network Diagram Parallel Coordinates

(individual lines)

II: Hotmap, A Personal Story

One of the most popular spots in the world.

Based on a table with a few billion rows

South Dakota: zoom on the center of the map

How We Know What We Know Building data-based systems for a long time

Interview study with data analysts, published “Interactions with Big Data Analytics”

Building “Big Sky” system (SIGMOD demo as “Stat”) ◦ Integrated Visualizations ◦ Streaming Data Streams

Outline Data Processing Constraints

Data Communication

Data Aggregation

Processing DESIGN CONSTRAINTS FOR VISUALIZING BIG DATA

Solution Space ◦ Work Offline ◦ Index ◦ (OLAP, InMems, Nanocubes) ◦ Restrict Data ◦ Sample (or Stream) ◦ Divide & Conquer

ONE-PASS ALGORITHMS Touch each data point once

In a histogram—where does it go? ◦ Categorization is easy. (“Bucket A”) ◦ But … what about other bucketing algorithms? Database Sketches: one-pass approximations Standard deviation, mean are fairly easy “Is the highest value” is very hard “Falls in the top 10%” isn’t bad

Two strategies for exploration DIVIDE AND CONQUER

ONLINE QUERY PROCESSING

Time

100%

Online Traditional

The Progressive Pitch

Trust Me, I'm Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster",

Design Constraints for Visualizing Big Data There’s Too Much to Process You’ll Never See It All The Rules Change Streaming is Hard

Fallback: Reservoir Sampling Streaming sample Keep a sample of k elements of the data such that each element has a k/size chance of being in the reservoir

Raw data Relevant dimensions Apply buckets on dimensions

Filter data Aggregate data Create shapes

Assign scales to shapes

Render to screen

DISPLAY DESIGN CONSTRAINTS FOR VISUALIZING BIG DATA

Solution Space AGGREGATE One visual point represents multiple data points SAMPLE Show only some of the dataset

THINK AGGREGATION Bar Chart -> Histogram

Points on a Map, -> 2D Histogram Scatterplot, Heatmap

Line Chart -> Approx Line Chart

Parallel Coordinates -> Area Para Coordinates

What about network diagram?

SAMPLING: You’ll never know it all

TASKS

Find extreme

Compare bars

Bar to constant

Bar to range

Order (top-K)

SAMPLING: Probabilistic Views “Sample-Oriented Task-Driven Visualizations: Allowing Users to Make Better, More Confident Decisions”

Design Goals

Easy to interpret

Consistency across tasks

Spatial Stability

Minimize Visual Noise (overhead)

“Is Bar A > Bar B”

Other Tasks

Find extreme Compare to value Compare to Range

Design Problems DESIGN CONSTRAINTS FOR VISUALIZING BIG DATA

1 IS VERY DIFFERENT FROM 0 When the Y axis goes all the way to very high values, it’s still very interesting to know which values are possible

STREAMING MEANS YOUR WORLD CAN CHANGE

Categorical -> too many categories! Numerical -> changing bounds Any color map or scale can change

STREAMING, STORING, SENDING

Implications for interaction, for updates Care a lot about changes that are server-side only vs client-only. Change color, change height scale … vs change bucket size. [Research opportunity: what are the tradeoffs of different models?]

Disk Data Aggregate Shapes Render Screen

Network? (D3)

Network? (Tableau Public)

Network? (SVG)

Hard to Do Research This isn’t the way SQL works today

You don’t want to stand up a Hadoop cluster yourself—and it’s a whole other skillset.

You can approximate: ◦ repeat medium-sized data over and over? ◦ generate data based on a model?

Conclusion Design Constraints for Visualizing Big Data ◦ There’s Too Much to Process ◦ You’ll Never See It All ◦ The Rules Change ◦ Streaming Adds New Challenges

Visualization of Big DataSlide Number 2ContentsThe Three VsDefining “Big” VolumeSlide Number 6Exploration is not presentationGoals“the size of the dataset is part of the problem”Problem SpaceRendering ProblemII: Hotmap,�A Personal StorySlide Number 13Slide Number 14Slide Number 15How We Know What We KnowOutlineProcessingSolution SpaceONE-PASS ALGORITHMSTwo strategies for explorationThe Progressive PitchFallback: Reservoir SamplingSlide Number 24DISPLAYSolution SpaceTHINK AGGREGATIONSAMPLING: You’ll never know it allSAMPLING: Probabilistic Views“Is Bar A > Bar B”Other TasksDesign Problems1 IS VERY DIFFERENT FROM 0STREAMING MEANS YOUR WORLD CAN CHANGESTREAMING, STORING, SENDINGHard to Do ResearchConclusion

Visualization of Big Data - Chalmers · Visualization of Big Data . DANYEL FISHER, MICROSOFT...

Documents

Transcript of Visualization of Big Data - Chalmers · Visualization of Big Data . DANYEL FISHER, MICROSOFT...