Visualization of Heterogeneous Data Mike Cammarano Xin (Luna) Dong Bryan Chan Jeff Klingner Justin...

41
Visualization of Heterogeneous Data Mike Cammarano Xin (Luna) Dong Bryan Chan Jeff Klingner Justin Talbot Alon Halevy Pat Hanrahan

Transcript of Visualization of Heterogeneous Data Mike Cammarano Xin (Luna) Dong Bryan Chan Jeff Klingner Justin...

Visualization of Heterogeneous Data

Visualization of Heterogeneous Data

Mike CammaranoXin (Luna) Dong

Bryan ChanJeff KlingnerJustin TalbotAlon Halevy

Pat Hanrahan

Homogeneous data is easy.Homogeneous data is easy.

Company Founded Headquarters Logo

Microsoft 1975 47.6 N, 122.1 W

Enron 1985 29.7 N, 95.3 W

Google 1998 37.4 N, 122.0 W

Homogeneous data is easy.Homogeneous data is easy.

Company Founded Headquarters Logo

Microsoft 1975 47.6 N, 122.1 W

Enron 1985 29.7 N, 95.3 W

Google 1998 37.4 N, 122.0 W

1970 1980 1990 2000

1975

1985

1998

Homogeneous data is easy.Homogeneous data is easy.

Company Founded Headquarters Logo

Microsoft 1975 47.6 N, 122.1 W

Enron 1985 29.7 N, 95.3 W

Google 1998 37.4 N, 122.0 W

1970 1980 1990 2000

Multiple sources?Multiple sources?

• Collaborative content

• Semi-structured data

{{Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2.jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 ...| birth_date = {{birth date|1809|1|19|mf=y}}| birth_place = [[Boston, Massachusetts]] [[United States|U.S.]]| death_date = {{death date and age|1849|10|07|1809|01|19}}| death_place = [[Baltimore, Maryland]] [[United States|U.S.]]| occupation = Poet, short story writer, editor, literary critic| movement = [[Romanticism]], [[Dark romanticism]]| genre = [[Horror fiction]], [[Crime fiction]], [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]...

DBpedia.orgDBpedia.org

• DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web.

• The DBpedia dataset currently provides information about more than 1.95 million “things”, including at least:• 80,000 persons• 70,000 places• 35,000 music albums• 12,000 films

According to DBpedia.org:

Database sizeDatabase size

We use a subset of DBpedia, mostly infoboxes and geonames.

• 30 M triples• 2.5 GB

We currently use an in-memory database.

Hardware is dual processor, dual core AMD opteron 280’s w/ 8GB RAM.

A glimpse inside DBpediaA glimpse inside DBpedia

A glimpse inside DBpediaA glimpse inside DBpedia

Kerry:Poe:

dbp: PLACE_OF_BIRTH dbp: latitude 39° 41´ 45˝ N

dbp: birth_place w3c: owl#sameAs geonames: latitude 42.358403

HeterogeneityHeterogeneity

• Types• Decimal vs. sexagesimal coordinates

• Names• PLACE_OF_BIRTH vs. birth_place

• Pathsdbp: PLACE_OF_BIRTH dbp: latitude

vs.dbp: birth_place w3c: owl#sameAs geonames: latitude

39° 41´ 45˝ N 39.70

Scenario / DemoScenario / Demo

Scenario / DemoScenario / Demo

Scenario / DemoScenario / Demo

Scenario / DemoScenario / Demo

Scenario / DemoScenario / Demo

Scenario / DemoScenario / Demo

Scenario / DemoScenario / Demo

Vision: Self-configuring dataVision: Self-configuring data

ContributionsContributions

• Visualize heterogeneous data represented as a graph of relationships between objects

• Describe inputs to a visualization:• Visualization template• Set of keywords per attribute

• Find attributes needed for a visualization by searching paths• Within an iterative process of search, visualization, and refinement

• Present algorithm for finding and ranking paths based on keywords• Efficiently enumerate paths

• A*• Random sampling

• Rank according to:• Keywords• Heuristics about graph structure

Integrate searching and visualizationIntegrate searching and visualization

Search for potentially

desirable paths

Refine path Visualize results

selections in context

Matching problemMatching problem

• Find the best path to a number for “state latitude”

stat

e

capitallatitude

DianneFeinstein

42.4

pop

6349000

birthplacespouse latitude

39.0

party

house

leadername

color blue

HarryReid

governor4

children

state.capital.latitude

state.pop

spouse.birth_place.latitude

state.governor.children

state.capital.latitude

state.pop

spouse.birth_place.latitude

state.governor.children

Basic algorithmBasic algorithm

1. Explore graph

2. Find paths ending

in a number

3. Score andrank paths

using TF/IDF

• Find the best path to a number for “state latitude”

stat

e

capitallatitude

DianneFeinstein

42.4

pop

6349000

birthplacespouse latitude

39.0

party

house

leadername

color blue

HarryReid

0.8

0.5

0.6

0.5

governor4

children

Improving execution timeImproving execution time

• New pruning techniques since the paper submission• A*• Bidirectional search on terms• Random sampling

Pruning techniquesPruning techniques

stat

e

capitallatitude

DianneFeinstein

42.4

pop

6349000

birthplacespouse latitude

39.0

party

house

leadername

color blue

HarryReid

governor4

children

• Most paths do not correspond to a “state latitude”• How can we avoid such bad paths?

No mention of latitude

Many unrelated terms

No potential paths

stat

e

capitallatitude

DianneFeinstein

42.4

pop

6349000

birthplacespouse latitude

39.0

party

house

leadername

color blue

HarryReid

governor4

children

Pruning techniques / A* SearchPruning techniques / A* Search

• Use a scoring function that penalizes unrelated terms• Then an A* search ignores paths with many such terms

Many unrelated terms

A* pruning resultsA* pruning results

Senators on map

Average # of edges examined at each depth, full enumeration:

Average # of edges examined at each depth, using A*:

1 2 3 4

Image 66 2049 1615 198

Name 66 9 5092 228

latitude 66 598 2272 2148

1 2 3 4

Image 66 5409 134226 1393766

Name 66 5446 168673 5245035

latitude 66 5408 145549 1009247

stat

e

capitallatitude

42.4

pop

6349000

birthplacespouse latitude

39.0

party

house

leadername

color blue

HarryReid

governor4

DianneFeinstein

children

Pruning techniques / Random SamplingPruning techniques / Random Sampling

• Do normal A* search for n randomly chosen nodes

No potential paths

A hit!

stat

e

capitallatitude

42.4

pop

6349000

birthplacespouse latitude

39.0

party

house

leadername

color blue

HarryReid

governor4

children

Pruning techniques / Random SamplingPruning techniques / Random Sampling

• Do normal A* search for n randomly chosen nodes

No potential paths

JohnKerry

• Only search known hits for the remaining nodes• Prevents repeatedly checking where there are likely no paths

A hit!

Sampling resultsSampling results

Average # edges examined at all depths:

Total edges examined:

without sampling 7360×99 = 728640

with sampling 7360×10 + 580×89 = 125220

Seed nodes (10) Others (89)

Image 920 82

Name 40 35

State 200 175

Latitude 3100 144

Longitude 3100 144

TOTAL 7360 580

PerformancePerformance

Runtime for senators’ example:

Runtime for astronauts’ example:

Runtime for each field in countries’ example:

• Performance now interactive• With new pruning techniques, ~100x faster than reported in paper.

State latitude State longitude Image Name Instances total

0.911 0.854 0.542 0.513 0.187 3.007 sec

Mission launch Mission insignia Name Instances total

1.109 1.151 0.743 1.102 4.105 sec

GDP per capita Inflation Flag Name Instances total

1.142 2.228 0.867 1.108 1.136 6.481 sec

Variations – senators’ flags versus birth placesVariations – senators’ flags versus birth places

Timeline of manned spaceflightTimeline of manned spaceflight

Scatterplot of inflation vs. GDPScatterplot of inflation vs. GDP

Precision / RecallPrecision / Recall

Correct Incorrect

64 34 Accepted

1 0 Rejected

Senators – state latitude:

Correct Incorrect

206 58 Accepted

9 0 Rejected

Countries – gdp per capita:

Correct Incorrect

86 6 Accepted

0 6 Rejected

Senators – image:

SummarySummary

• Visualize heterogeneous data represented as a graph of relationships between objects

• Produce visualizations conforming to templates by searching for needed attributes

• Present algorithm for finding and ranking paths based on keywords• Efficiently enumerate paths• Rank

• Now fast enough for interactive use• High precision and recall

Future workFuture work

• Improvements• UI support for initial discovery and query refinement• Robustness of terms / Improved ranking• Automatic selection of visualization• Visualizing missing data• Visualizations that reflect result relevance (selective emphasis)

• Deploy on the web• Wikipedia• The whole web

AcknowledgementsAcknowledgements

Funding sources:• Boeing• RVAC• CALO

Tools and data:• DBpedia• MIT SIMILE project timeline• Tom Patterson’s map artwork

The end!The end!

stat

e

capitallatitude

DianneFeinstein

42.4

pop

6349000

birthplacespouse latitude

39.0

party

house

leadername

color blue

HarryReid

governor4

children

Pruning techniquesBidirectional SearchPruning techniques

Bidirectional Search

• Before A*, search one step back from each literal,following only edges that match keywords

No mention of latitude

• This saves one step during forward A* search

Need for multiple pathsNeed for multiple paths

Need for multiple pathsNeed for multiple paths