CS B :: Oct 2009 Text Visualization · 10/28/2009 · 1 CS448B:: 28 Oct 2009 Text Visualization...

CS448B :: 28 Oct 2009

Text Visualization

Jeffrey Heer Stanford University

Final Project

Project: Visualizing a Century of Change in the American West and Generating Social Sense-Making through Media Partnerships

f d d f d hData: 100 years of Census data, data from iMapData.com, and other sources.

Description: Explore a century of change in the American West by visualizing 100 years of data on the some of most compelling, urgent, contemporary issues — economy, health and health care, environment, education, religion, communications, energy, and transportation. Stimulate public discussion by revealing patterns and processes and generate widespread public participation in social sense-making by publishing the results with our media partners, including The New York Times, Wall Street Journal, Washington Post and Marketplace on National Public Radio Work in a collaborativeWashington Post, and Marketplace on National Public Radio. Work in a collaborative setting with a ready-made team of scholars at the Bill Lane Center for the American West (Jon Christensen, David Kennedy, Michael DeAlessi), the Communication Department's Journalism Program (Ann Grimes, Fred Turner), Knight Journalism Fellowship (Krissy Clark), and Tooling Up for Digital Histories graduate student researchers (Yuankai Ge, Dan Chang). Study the results as a real-world experiment in data visualization, social sense-making, and human-computer interactions.

http://ruralwest.stanford.edu/cgi-bin/RuralWestPop/

Final ProjectDesign a new visualization technique or system

Implementation of new design or systemImplementation of new design or system8-10 page paper in conference paper format2 Project Presentations

ScheduleProject Proposal: Wednesday, November 4 (end of day)Initial Presentation: Monday, November 9 & Wednesday, November 11Poster Presentation Wednesday December 2 (tentative)Poster Presentation: Wednesday, December 2 (tentative)Final Papers: Monday, December 7 (by noon)

LogisticsGroups of up to 3 people, graded individuallyClearly report responsibilities of each member

Assign 3: Interactive Visualization

Create an interactive visualization application. Choose a data domain and select an appropriate visualization techniquedomain and select an appropriate visualization technique.

1. Choose a data set and storyboard your interface2. Implement the interface using tools of your choice3. Submit your application and produce a final write-up

You may work individually or in groups of 2.y y g pDue by end of day on Wednesday, October 28

Text Visualization Why visualize text?

Visualization PopularityService

Many‐Eyes SwivelTag Cloud

Bubble GraphWord TreeBar Chart

MapsNetwork Diagram

TreemapMatrix ChartLine GraphScatterplot

Stacked GraphPie Chart

Over 1/3 of Many-Eyes visualizations use free text

0.0 0.1 0.2 0.3 0.4 0.5

Percentage

0.0 0.1 0.2 0.3 0.4 0.5

Percentage

Pie ChartHistogram

What is text data?

DocumentsA ti l b k d lArticles, books and novelsComputer programsE-mails, web pages, blogsTags, comments

Collection of documentsMessages (e-mail, blogs, tags, comments)Social networks (personal profiles)Academic collaborations (publications)

Challenge: Visualize Dissertations

You have 20 years of university Ph.D. theses:Text, Year, Dept., Author, Advisor, Committee

What questions might you want to answer?What visualizations might help?

A Concrete Example

What would help you gauge…

The topics in the document?Whether or not you should read it?yIts relationship to other documents?

Tag Cloud: Word Counts Word Tree: Word Sequences

PhraseNet: “A the B”MIMIR – Jason Chuang

Topics

Text as DataVisualizing Document ContentEvolving DocumentsVisualizing ConversationDocument Collections

Text as Data

Words are (not) nominal?

High dimensional (10,000+)More than equality testsWords have meanings and relations

Correlations: Hong Kong, San Francisco, Bay Area

Order: April, February, January, June, March, May

M b hiMembership: Tennis, Running, Swimming, Hiking, Piano

Hierarchy, antonyms & synonyms, entities, …

Text Processing Pipeline

Tokenization: segment text into termsSpecial cases? e.g., “San Francisco”, “L’ensemble”, “U.S.A.”Remove stop words? e.g., “a”, “an”, “the”, “to”, “be”?

Stemming: one means of normalizing termsReduce terms to their “root”; Porter’s algorithm for Englishe g automate(s) automatic automation all map to automate.g., automate(s), automatic, automation all map to automatFor visualization, want to reverse stemming for labels

Simple solution: map from stem to the most frequent word

Result: ordered stream of terms

The Bag of Words Model

Ignore ordering relationships within the text

A document ≈ vector of term weightsEach dimension corresponds to a term (10,000+)Each value represents the relevance

For example, simple term counts

Aggregate into a document x term matrixDocument vector space model

Document x Term matrix

Each document is a vector of term weightsSimplest weighting is to just count occurrences

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

WordCount (Harris 2004)

http://wordcount.org

Weaknesses of Tag Clouds

Sub-optimal visual encoding (size vs. position)Inaccurate size encoding (long words are bigger)May not facilitate comparison (unstable layout)Term frequency may not be meaningfulDoes not show the structure of the text

Keyword Weighting

Term Frequencytf t(t) i dtftd = count(t) in dCan take log frequency: log(1 + tftd)Can normalize to show proportion: tftd / Σt tftd

Keyword Weighting

Term Frequencytf t(t) i dtftd = count(t) in d

TF.IDF: Term Freq by Inverse Document Freqtf.idftd = log(1 + tftd) × log(N/dft)dft = # docs containing t; N = # of docs

Keyword Weighting

Term Frequencytf t(t) i dtftd = count(t) in d

TF.IDF: Term Freq by Inverse Document Freqtf.idftd = log(1 + tftd) × log(N/dft)dft = # docs containing t; N = # of docs

G2: Probability of different word frequencyG2: Probability of different word frequencyE1 = |d| × (tftd + tft(C-d)) / |C|E2 = |C-d| × (tftd + tft(C-d)) / |C|G2 = 2 × (tftd log(tftd/E1) + tft(C-d) log(tft(C-d)/E2))

Doc. Similarity & Clustering

In vector model, compute distance among docsFor TF.IDF, typically cosine distanceSimilarity measure can be used to cluster

Topic modeling approachesAssume documents are a mixture of topicsTopics are (roughly) a set of co-occurring termsLatent Semantic Analysis (LSA): reduce term matrixLatent Dirichlet Allocation (LDA): statistical model

Visualizing Document Content

TileBars [Hearst]

Visual Thesaurus [ThinkMap]

Concordance

What is the common local context of a term?

WordTree (Wattenberg et al)

Filter infrequent runs Recurrent themes in speech

Glimpses of structure

Concordances show local, repeated structureBut what about other types of patterns?

For example Lexical: <A> at <B> Syntactic: <Noun> <Verb> <Object>Syntactic: <Noun> <Verb> <Object>

Phrase Nets [van Ham et al]

Look for specific linking patterns in the text:‘A and B’, ‘A at B’, ‘A of B’, etcCould be output of regexp or parser.

Visualize extracted patterns in a node-link viewOccurrences Node sizePattern position Edge direction

Portrait of the Artist as a Young ManX and Y

Node Grouping The BibleX begat Y

Pride & PrejudiceX at YLexical Parser, < 1sec running time

Pride & PrejudiceX at YSyntactic Parser, > 24 hours running time

18th & 19th Century NovelsX’s Y

X of Y

Evolving Documents

Visualizing Revision History

How to depict contributions over time?

Example: Wikipedia history log

Animated Traces (Ben Fry)

http://benfry.com/traces/

Diff History Flow (Viégas et al)

Wikipedia History Flow (IBM)

Visualizing Conversation

Many dimensions to consider:Who (senders, receivers)What (the content of communication)When (temporal patterns)

Interesting cross-products:What x When Topic “Zeitgeist”Who x Who Social networkWho x Who x What x When Information flow

Usenet Visualization (Viégas & Smith)

Show correspondence patterns in text forumsInitiate vs. reply; size and duration of discussion

Newsgroup crowds / Authorlines

Mountain (Viégas)

Conversation by person over time (who x when).

Themail (Viégas et al)

One person over time, TF.IDF weighted terms

Enron E-Mail Corpus

Washington Lobbyist ?Washington Lobbyist ? Visualizing Document Collections

NewsMap: Google News Treemap (Marcos Weskamp)

10 x 10 News Map (Harris 2004)

Named Entity Recognition

Identify and classify named entities in text:John Smith PERSONSoviet Union COUNTRY353 Serra St ADDRESS(555) 721-4312 PHONE NUMBER

Entity relations: how do the entities relate?Simple approach: do they co-occur in a small window of text?

ThemeRiver (Havre et al 99)

0.0035

Statistical Machine Learning in Pubmed

supervised machine learning

probabilistic reasoning

dimensionality / kernels

l i / i il i

0.0015

0.0025

clustering / similarity

bayesian learning

0.0005

1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

Track topic strengths over time

Parallel Tag Clouds (Collins et al 09) Natural Science and Engineering(Ellingham 1948)

Manual analysis, ylayout, and labeling

TreemapCentral disciplinesSub-disciplinesTopics

H. J. T. Ellingham 1948Royal Society Scientific Information Conference

Topics

Geography of Science (Small et al. 1985)

Citation analysisHierarchy clusteringLevels of detailsEdge labeling

Henry Small, Eugene Garfield 1985The geography of science, Journal of Information Science

Scholarly Communication (Sandstrom 2001)

Hand-picked authors, citation analysisClustering by authorsManual cluster & dimension labeling

Pamela Sandstrom 2001Scholarly communication as a socio-ecological system, Scientometrics

Melanoma Research (Boyack 2004)

Papers, genes, proteinsLandmark labeling

(manual)

Kevin Boyack, Ketan Mane, Katy Börner 2004Mapping medline papers, genes, and proteins related to

Melanoma research, IV2004 Conference

Scholarly Impact (Rosvall et al. 2007)

M. Rosvall, C. T. Bergstrom 2007Map of random walks on complex network

reveal community structure, PNAS

MIMIR – Jason Chuang

CS B :: Oct 2009 Text Visualization · 10/28/2009 · 1 CS448B:: 28 Oct 2009 Text Visualization...

Documents

Transcript of CS B :: Oct 2009 Text Visualization · 10/28/2009 · 1 CS448B:: 28 Oct 2009 Text Visualization...

Combining automated analysis and visualization …isg.cs.uni-magdeburg.de/visual/files/publications/2009/...Combining automated analysis and visualization techniques for effective

Topological Visualization of Brain Diffusion MRI Datapeople.kyb.tuebingen.mpg.de/tschultz/papers/schultz-dti... · 2009-10-31 · Topological Visualization of Brain Diffusion MRI

CDS 301 Fall, 2009 Vector Visualization Chap. 6 October 7, 2009 Jie Zhang Copyright ©

3RVLWLRQDO 'UXP 1RWDWLRQ - Stanford Universitystanford.edu/~mishel/CS448B/PositionalDrumNotation... · 2019-07-23 · 5hdgdelolw\ ä

CS B Nov 2011 Text Visualization - Stanford University · 1 CS448B:: 17 Nov 2011 Text Visualization Jason Chuang Stanford University Why visualize text? Why visualize text? Understanding–

Visualization and Analysis Activities May 19, 2009

Direct Visualization of Fiber Information by Coherencehamann/HlawitschkaGarthScheu... · 2009. 3. 18. · Direct Visualization of Fiber Information by Coherence Mario Hlawitschka

The Future of Data Visualization - RIAdventure 2009

How much data (bytes) did we produce in 2010?hci.stanford.edu/courses/cs448b/f12/lectures/CS448B...1 CS448B:: 25 Sep 2012 The Value of Visualization Jeffrey Heer Stanford University

Visualization: Basics of Rendering in AutoCAD 2009 - aapp

CDS 301 Fall, 2009 Scalar Visualization Chap. 5 September 24, 2009 Jie Zhang Copyright ©

Sep 2009 The Value of Visualization - Stanford University › courses › cs448b › f09 › ... · Assignment 2: Exploratory Data Analysis (15%) Assignment 3Assignment 3: Creating

NVIDIA Quadro Plex Scalable Visualization Solutions July 2009.

Programming Tutorial: Flare Visualization LibraryProgramming Tutorial: Flare Visualization Library CS448b Data Visualization Jason Chuang January 30, 2008

CS448B Value of Visualization - University of Washingtoncourses.cs.washington.edu/...ValueOfVisualization.pdf · The ability to take data—to be able to understand it, to process

Visualization Update June 18, 2009 Kelly Gaither, GIG Area Director DV.

Visualization (Re-)Design - Stanford HCI Grouphci.stanford.edu/cs448b/w09/lectures/20090114-Discussion.pdf · Visualization (Re-)Design Jeffrey Heer assistant: Jason Chuang Assignment

Parallel Visualization on Leadership Computing Resourcesvis.unl.edu/~yu/homepage/publications/paper/2009... · Parallel Visualization on Leadership Computing Resources One approach

interaction - SNUvplab.snu.ac.kr/lectures/09-2/infovis/notes/interaction.pdf · · 2009-10-27Information Visualization Fall 2009 JinwookSeo SNU CSE ... Information Visualization:

sayito.comsayito.com/files/iluna_portfolio.pdf · 3d model, Visualization, Installation Summer 2010 . Loop House Borden Partnership Visualization Summer 2009 . Caltrans District 7