Visualization of Big Data - Chalmers · Visualization of Big Data . DANYEL FISHER, MICROSOFT...
Transcript of Visualization of Big Data - Chalmers · Visualization of Big Data . DANYEL FISHER, MICROSOFT...
-
Visualization of Big Data DANYEL FISHER, MICROSOFT RESEARCH
-
Contents
Big Data & Visualization Overview
Background, and How We Know What We Know
Design Constraints for Visualizing Big Data ◦ There’s Too Much to Process ◦ You’ll Never See It All ◦ The Rules Change ◦ Streaming Adds New Challenges
-
The Three Vs Volume Velocity Variety
… and I’ll add one more: “Visitation.” This is what we used to call “Exploratory Data Analysis”, but I want to keep up with the “V” thing.
-
Defining “Big” Volume
“…200,000 magnetic tape reels which represent over 900 billion characters of data”
1975
-
Exploration is not presentation EXPLORATION:
Learn about the dataset
Explore multiple hypotheses
Manipulate data freely
May be discarded after completion
Examples: Some of Tableau, PowerView, GGPLOT, etc
PRESENTATION:
Communicate a specific view
Constrain interaction
Visual style important
Examples: visual dashboards, data storytelling
-
Goals Responsive, exploratory visualization
We’re NOT interested in ◦ Pre-cooked datasets and visualizations ◦ Knowing precisely what you plan to look at / do
-
“the size of the dataset is part of
the problem”
-
Problem Space On one PC, it ◦ Run out of screen to draw each data point [106] ◦ Takes a long time to look at every data point [109] ◦ May not be able to store all the data points [1012]
-
Rendering Problem
x
y
Scatterplot (at least one pixel per point)
Network Diagram Parallel Coordinates
(individual lines)
-
II: Hotmap, A Personal Story
-
One of the most popular spots in the world.
Based on a table with a few billion rows
-
South Dakota: zoom on the center of the map
-
How We Know What We Know Building data-based systems for a long time
Interview study with data analysts, published “Interactions with Big Data Analytics”
Building “Big Sky” system (SIGMOD demo as “Stat”) ◦ Integrated Visualizations ◦ Streaming Data Streams
-
Outline Data Processing Constraints
Data Communication
Data Aggregation
-
Processing DESIGN CONSTRAINTS FOR VISUALIZING BIG DATA
-
Solution Space ◦ Work Offline ◦ Index ◦ (OLAP, InMems, Nanocubes) ◦ Restrict Data ◦ Sample (or Stream) ◦ Divide & Conquer
-
ONE-PASS ALGORITHMS Touch each data point once
In a histogram—where does it go? ◦ Categorization is easy. (“Bucket A”) ◦ But … what about other bucketing algorithms? Database Sketches: one-pass approximations Standard deviation, mean are fairly easy “Is the highest value” is very hard “Falls in the top 10%” isn’t bad
-
Two strategies for exploration DIVIDE AND CONQUER
ONLINE QUERY PROCESSING
Time
100%
Online Traditional
-
The Progressive Pitch
Trust Me, I'm Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster",
Design Constraints for Visualizing Big Data There’s Too Much to Process You’ll Never See It All The Rules Change Streaming is Hard
-
Fallback: Reservoir Sampling Streaming sample Keep a sample of k elements of the data such that each element has a k/size chance of being in the reservoir
-
Raw data Relevant dimensions Apply buckets on dimensions
Filter data Aggregate data Create shapes
Assign scales to shapes
Render to screen
-
DISPLAY DESIGN CONSTRAINTS FOR VISUALIZING BIG DATA
-
Solution Space AGGREGATE One visual point represents multiple data points SAMPLE Show only some of the dataset
-
THINK AGGREGATION Bar Chart -> Histogram
Points on a Map, -> 2D Histogram Scatterplot, Heatmap
Line Chart -> Approx Line Chart
Parallel Coordinates -> Area Para Coordinates
What about network diagram?
-
SAMPLING: You’ll never know it all
TASKS
Find extreme
Compare bars
Bar to constant
Bar to range
Order (top-K)
-
SAMPLING: Probabilistic Views “Sample-Oriented Task-Driven Visualizations: Allowing Users to Make Better, More Confident Decisions”
Design Goals
Easy to interpret
Consistency across tasks
Spatial Stability
Minimize Visual Noise (overhead)
-
“Is Bar A > Bar B”
-
Other Tasks
Find extreme Compare to value Compare to Range
-
Design Problems DESIGN CONSTRAINTS FOR VISUALIZING BIG DATA
-
1 IS VERY DIFFERENT FROM 0 When the Y axis goes all the way to very high values, it’s still very interesting to know which values are possible
-
STREAMING MEANS YOUR WORLD CAN CHANGE
Categorical -> too many categories! Numerical -> changing bounds Any color map or scale can change
-
STREAMING, STORING, SENDING
Implications for interaction, for updates Care a lot about changes that are server-side only vs client-only. Change color, change height scale … vs change bucket size. [Research opportunity: what are the tradeoffs of different models?]
Disk Data Aggregate Shapes Render Screen
Network? (D3)
Network? (Tableau Public)
Network? (SVG)
-
Hard to Do Research This isn’t the way SQL works today
You don’t want to stand up a Hadoop cluster yourself—and it’s a whole other skillset.
You can approximate: ◦ repeat medium-sized data over and over? ◦ generate data based on a model?
-
Conclusion Design Constraints for Visualizing Big Data ◦ There’s Too Much to Process ◦ You’ll Never See It All ◦ The Rules Change ◦ Streaming Adds New Challenges
Visualization of Big DataSlide Number 2ContentsThe Three VsDefining “Big” VolumeSlide Number 6Exploration is not presentationGoals“the size of the dataset is part of the problem”Problem SpaceRendering ProblemII: Hotmap,�A Personal StorySlide Number 13Slide Number 14Slide Number 15How We Know What We KnowOutlineProcessingSolution SpaceONE-PASS ALGORITHMSTwo strategies for explorationThe Progressive PitchFallback: Reservoir SamplingSlide Number 24DISPLAYSolution SpaceTHINK AGGREGATIONSAMPLING: You’ll never know it allSAMPLING: Probabilistic Views“Is Bar A > Bar B”Other TasksDesign Problems1 IS VERY DIFFERENT FROM 0STREAMING MEANS YOUR WORLD CAN CHANGESTREAMING, STORING, SENDINGHard to Do ResearchConclusion