HOMME: Ontological Explorer

Post on 04-Jun-2022

15 views 0 download

Transcript of HOMME: Ontological Explorer

HOMME: Hierarchical‐Ontological Mind Map Explorer

Yi‐Shin Chen, Pei‐Ling Hsu, Hsiao‐Shan Hsieh, Li‐Chin Lee, Carlos ArguetaInstitute of Information Systems and Applications

N i l T i H U i iNational Tsing Hua UniversityIDEA Lab

Outline

• Introduction to IDEA Lab

• Introduction to HOMMEIntroduction to HOMME

• Framework

• Experimental Evaluation

• Conclusions and Future WorkConclusions and Future Work

llIntelligent Data Engineeringand Applications (IDEA) Laboratoryand Applications (IDEA) Laboratory 

Research Focus

Query

Mining

Optimization

Query

Storage

Index

DB

Corresponding Research Issuesp g

AI

HCI Network

DatabaseWeb Pattern Recognition

CURRENT PROJECTS

Current Projectsj

• GoogolPlex

– Web information integration and retrievalg

– Topic expansion and integration

Group “answers” based on topic and sentiment– Group  answers  based on topic and sentiment

GoogolPlex Project (Cont.)g j ( )

• Apply cloud computing to speed up the analysis in large scale and heterogeneous data (Googolplex size)

GoogolPlex Project (Cont.)g j ( )

R l t d h i• Related research issues– Automatic Ontology construction from heterogeneous data

GoogolPlex Project (Cont.)g j ( )

• Related research issues– Sentiment analysis for short articles (e.g., micro‐blogs, social network messages) in multi‐language environments

I hate it when it’s rainy and cold!I hate it when it s rainy and cold!

Loved today’s trip.

I can’t believe this happened!

GoogolPlex Project (Cont.)g j ( )

• Related research issues– Keyword extraction from short articles (e.g., micro‐blogs, social network messages) in multi‐language environments

…task of algorithm analysis consists…

…in a Markov Chain is…

…when sorting is…

GoogolPlex Project (Cont.)g j ( )

• Related research issues– Semantic analysis for different purposes, such as geo‐tagging

– TweoLocator: A Non‐Intrusive Geographical LocatorSystem for Twitter

Id if h l i f i l i i• Identify the location of a particular Twitter at a given time

Using exclusively the content of his/her tweets– Using exclusively the content of his/her tweets

HOMME Conceptual Finder Demop

HOMME Ontology Builder Demo(cont’d)gy ( )

TweoLocator: Framework

TweoLocator: Experimental Resultsp

50%

60%

70%

80%

350

400

450

500

20%

30%

40%

50%

100

150

200

250

300

Tweets

70%80%90%100%

200

250

US GB CA AU INOTHERS

Avg Acc

0%

10%

0

50

100

30%40%50%60%70%

100

150

ProfilesCorrect tweets 463 288 353 169 125 23 65.6%

Unrelated Tweets 110 55 114 53 41 18 18.1%

Disagreed & Reallocated

142 175 22 0 14 0 16.3%

Accuracy 65% 56% 72% 76% 69% 56% 66%

US GB PH CA Others AU SE

Correct 240 88 39 28 26 22 9

0%10%20%

0

50y

Wrong 24 3 0 2 0 0 0

N/A 44 17 3 6 1 3 1

Disagreement 16 0 0 1 0 0 0

Accuracy 74% 81% 93% 76% 96% 88% 90%

Current Projectsj

• GoogolPlex

– Web information integration and retrievalg

• iConductI i d i– Interactive conducting system

iConduct Projectj

• Analyze the intentions from data streams

• Instantly aggregate user intentions and multimedia data

Current Projectsj

• GoogolPlex

– Web information integration and retrievalg

• iConductI i d i– Interactive conducting systems

• MyMiningy g– Market analysis

MyMining Projecty g j

• Mining market information from– Stock data (numerical data)( )

– News, blogs, and micro blogs (text data)

• Find the relationship between Stock Market and social networking sites

Goal

• In this research, our goal is to build a system which can help us to :p– Automatically integrate the stock news and Identify the events.Identify the events.

– Evaluate the event influence on the industry level and use the information on verifying pricesand use the information on verifying prices movement.

MyMining Projecth d lMethodology

Off‐line

On‐line

Current Performance

• Accuracy of four methods:Methods Average 

Accuracy

Pheromone 0.5784574

Adjust 0 5323214Adjust regression

0.5323214

Regression 0.5134457

Blind test 0.3045479

PEOPLE IN IDEA LAB

Peoplep

• Current students:– Domestic students: 7

– International students:  8San Lucia

Nationality

i

Myanmar7%

7%

Taiwan46%

Honduras20%

El Salvador

Malaysia6%

Indonesia7%

7%

INTRODUCTION TO HOMME

Humans generate Knowledgeg g

• Collecting all human knowledge has always been a recurring goalg g

Internet Era

• WWW has made collecting all human knowledge possible.g p

Data Flood

• Redundant

• ScatteredScattered

• Mutually complementary

Integrationg

• It is crucial to integrate heterogeneous data sources.– Easier access

Summarization– Summarization.

– Less redundancy

Previous Work (1)( )

• Web data integration and organization based on expert knowledge or collaboratively‐p g ycreated (crowd wisdom) data– Manually– Manually 

– Semi‐automatic

– Automatic

Previous Work (2)( )

• Wikipedia: most successful collaboratively‐created collection of human knowledge on the gweb

U t t d ti l• Unstructured articles• Structured information (infoboxes)

Previous Work (3)( )

• Other works used Wikipedia structured data to integrate web data.g– YAGO: 

• Wikipedia Categories + WordNetWikipedia Categories  + WordNet

• http://www.mpi‐inf.mpg.de/yago‐naga/yago/

– DBpedia: • Wikipedia infoboxes

• http://dbpedia.org/About

Previous Work (4)( )

• Other sources of crowd wisdom studied to integrate and organize web datag g– Social annotations

Search logs– Search logs

Previous Work (5)( )

• Two approaches to integrate web data:– External Resources to extract relationshipsp

• Relatively small coverage

– Bottom‐up approach to web data integration• Difficulty in labeling the semantic relationships• Difficulty in labeling the semantic relationships

HOMME

• Relies on multiple heterogeneous “crowd wisdom” data sources.

B i f i• Bottom‐up extraction of semantic relationships present in the web data.

P t i d lik t ti f• Presents a mind map like representation of knowledge for easy navigation

FRAMEWORK

Framework

Data Sources

• Multiple heterogeneous data sources– Search logsg

– Social annotations: Delicious tags

Web directory: Open Directory Project (ODP)– Web directory: Open Directory Project (ODP)

Framework

Resource Integratorg

• Normalize and decompose heterogeneous data into smaller elements with common characteristics.

• We use the notion of word sequences and concept sequences

Word Sequencesq

h h l d d d• Every query in the search log is considered a word sequence

• Every URL in the search log can be decomposed into a word sequenceEvery URL in the search log can be decomposed into a word sequence

– www.mtv.com/music/artist/bowlingforsoupartist.jhtml

<mtv, music, artist, bowling, for, soup, artist>

• All the Delicious tags assigned to a URL are a word sequence

• The ODP title assigned to a URL is a word sequence.

• The ODP category assigned to a URL is turned into a word sequence.– E.g.    air/travel/agent  <air, travel, agent>

Concept Sequencesp q

• A sequence of words can represent concept

Framework

Term Extractor

• For each frequent word sequence it tries to split it into concepts.p p– E.g. Query: “star wars light saber”

Word sequence: <star wars light saber>Word sequence: <star, wars, light, saber>

Concept sequences: <<star, wars>, <light, saber>> 

Framework

Term Mapperpp

• Term Mapper uses the output of Term Extractor to build a features matrix.

1. Classify concepts by ODP category.

2. Frequency of tags assigned to queries as features.q y g g q

Framework

Relationship Finderp

• Input data from Term Extractor: Word sequences

• Goal of relationship Finder: p– Seeks to find important semantic relationships between word sequencesbetween word sequences

• Challenges:T d t t t did t i d– To detect concept candidates in word sequences

– To gather correlated concept candidates

– To name semantic relationships between concept candidates

Relationship Finderp

S l i• Solutions:– Rules of detecting concept candidates from word 

sequences • Mapped with existed concepts• Mapped with dictionaries• Mapped with dictionaries• Crowd wisdom

– Frequent queries– ODP titles

• Word sequences containing “of”

C id i th t t d– Considering the contexts among word sequences– Considering the meanings of relationships

Relationship Finderp

i hi l l i hi• Hierarchical Relationships– Has‐Subclass– Is‐A

• Synonymous RelationshipsSynonymous Relationships– Is‐Equal‐ToHas Meaning– Has‐Meaning

• Other relationships– Has‐Data‐About– Has‐Website

Relationship Finderp

i hi l l i hi• Hierarchical Relationships– Has‐Subclass

C l i hi i l i– Is‐A

• Synonymous Relationships

Common relationships in ontologies

Synonymous Relationships– Is‐Equal‐ToHas Meaning– Has‐Meaning

• Other relationships– Has‐Data‐About– Has‐Website

Relationship Finderp

i hi l l i hi• Hierarchical Relationships– Has‐Subclass Top down

class

Has‐Subclass

– Is‐A

• Synonymous RelationshipsclassBottom up

Synonymous Relationships– Is‐Equal‐ToHas Meaning

class

is a– Has‐Meaning

• Other relationships instance

is a

– Has‐Data‐About– Has‐Website

Has‐Subclass Relationship FinderpCommon relationships in ontologies

• Hierarchical Relationships– Has‐Subclass Top down

class

Has‐Subclass

• Utilizing ODP Categories

• Mapping with crowd wisdoms: frequent queries

class

Mapping with crowd wisdoms: frequent queries

• For instance“ l h ”– Query: “travel agent phone”

– ODP category: air/travel/agent

– Output: travel has‐Subclass travel agent

Is‐A Relationship Finderp

Hi hi l R l ti hiCommon relationships in ontologies

• Hierarchical Relationships– Is‐A

• Word sequences with crowd wisdom

class

Has‐SubclassBottom up

• Word sequences with crowd wisdom– Queries, ODP titles

• Hierarchies among word sequences

class

– Word sequences with “of”– Additional words for ambiguous words

• For instanceclass

is aFor instance– Query: “apple company”– Ambiguous word: apple

instance

is a

g pp– Additional words: company– Output: apple company Is‐A company

Relationship Finderp

i hi l l i hi• Hierarchical Relationships– Has‐Subclass– Is‐A

• Synonymous RelationshipsReferring to the same concepts

Synonymous Relationships– Is‐Equal‐ToHas Meaning– Has‐Meaning

• Other relationships– Has‐Data‐About– Has‐Website

Synonymous Relationship Finder(1)y y p ( )

• Many word sequences refer to the same concepts• Many word sequences refer to the same concepts.

• Is‐Equal‐To– <cartoonnetwork>, and <cartoon, network>

• Has‐Meaning– <ae>, <american, eagle>, and <american, eagle, outfitter>, , g , , g ,

• Finds distinct queries and ODP data referring to same concepts.

• Steps:1. Groups queries based on navigational intention

– Intention inferred from clicked URLs– Groups the navigational queries based on the clicked URL

2. ODP data is added to the groups based on their referring URLs.O data s added to t e g oups based o t e e e g U s

Synonymous Relationship Finder(2)y y p ( )

• For instance:– Query: “american eagle”Q y g

– Clicked URL: www.ae.com

ODP title: “american eagle outfitter”– ODP title:  american eagle outfitter

– Output:• “ae” has‐Meaning ”American eagle”

• ”American eagle” has‐Meaning “american eagle f ”outfitter”

Relationship Finderp

i hi l l i hi• Hierarchical Relationships– Has‐Subclass– Is‐A

• Synonymous RelationshipsSynonymous Relationships– Is‐Equal‐ToHas Meaning– Has‐Meaning

• Other relationships– Has‐Data‐About– Has‐Website

Has‐Data‐About Relationship Finderp

• S t i d d t t t i b• Some terms in word sequences denote concepts present in a web site.

• Finds frequent match between query terms and parts of clicked URLs.

• For instance:– Query: “bowling for soup”– Clicked URL: wwwmtv com/music/artist/bowlingforsoupartist jhtmlClicked URL: www.mtv.com/music/artist/bowlingforsoupartist.jhtml– Output:

• “mtv” has‐Data‐About “music”• “mtv” has‐Data‐About “artist”mtv  has Data About artist• “mtv” has‐Data‐About “bowling for soup”

Has‐Website Relationship Finderp

d f i d• Uses word sequences from queries, URLs, and ODP titles

• For instance:– Query: “online dictionary”– Clicked URL: www.m‐w.com– ODP title: “merriam‐webster online”– Output:p

• “online dictionary” has‐Website www.m‐w.com• “merriam‐webster online” has‐Website www.m‐w.com

Iterative Process

• The extracted relationships are used to improve the term extraction process.p p

C i i b h T• Constant interaction between the Term Extractor and the Relationship Finder.

Framework

Concept Cluster Finderp

U h f i d b T M• Uses the features matrix generated by Term Mapper.

• Uses k‐means algorithm to cluster queries.

• Each cluster automatically labeled based on cluster yrepresentative.– Features with highest scores

EXPERIMENTAL EVALUATION

Setupp

• Three data sources:– Search log by MS Live Labs from US users in May 2006

• 1,512,556 navigational queries extracted

– Open Directory Project (ODP) 

– Delicious tags crawled from February to May 2010

• Implementation:P f d PHP J S i I f Vi T lki– Prototype front end: PHP + JavaScript InfoVis Toolkit

Demonstration

Ontology Buildergy

Demonstration

Concept Linker (1)p ( )

Concept Linker (2)p ( )

Experimental Results – Concept Linkerp p

O k d h k• Our work was compared to other works:– Single‐link Agglomerative Hierarchical clustering(AHC)– DBSCAN

• We want to evaluate ability to discover query clusters.

• Ground truth: manually labeled 50 queries fromGround truth: manually labeled 50 queries from each category.

HOMME and AHC

HOMME and DBSCAN

Experimental Results ‐ Relationship Fi dFinder

• 11 volunteers checked sample of output relationshipsp

E h h k d 100 l f h l i hi• Each checked 100 tuples for each relationship type.– Total 400 output relationships

– All checked same setAll checked same set

Relationship Finder Evaluated by H E tHuman Expert

CONCLUSIONS AND FUTURE WORK

Conclusions

• The proposed approach uses heterogeneous sources to – Effectively cluster queries related to a concept.

Extract relationships between concepts– Extract relationships between concepts automatically.

• The relationships recognized by HOMME are also recognized by humans  most of the time.

Future Work

• Improve coverage for Relationship Finder

• Add more relationship types

• Improve execution times for offline partImprove execution times for offline part