Download - Deriving Emergent Web Page Semantics D.V. Sreenath*, W.I. Grosky**, and F. Fotouhi* *Wayne State University **University of Michigan-Dearborn.

Deriving Emergent Web Page Semantics

D.V. Sreenath*, W.I. Grosky**, and F. Fotouhi*

*Wayne State University**University of Michigan-Dearborn

Semantics

The semantics of a web page is potentially richer than can be defined by the page’s author(s)– Some semantics emerge through context – A multimedia document has multiple semantics

through being placed in multiple contexts

Content-Based Retrieval

Development of feature-based techniques for content-based retrieval is a mature area, at least for images

CBR researchers should now concentrate on extracting semantics from multimedia documents so that retrievals using concept-based queries can be tailored to individual users– The semantic gap

(Semi)-automated multimedia annotation

Multimedia Annotation(s)

Multimedia annotations should be semantically rich– Multiple semantics

This can be discovered by placing multimedia information in a natural, context-rich environment– A social theory based on how multimedia

information is used

Context-Rich Environments

Structural context – Author’s contribution– Document’s author places semantically similar pieces

of information close to each other

Dynamic context – User’s contribution– Short browsing sub-paths are semantically coherent

Context-Rich Environments

The WEB is a perfect example of a context-rich environment

Develop multimedia annotations through cross-modal techniques– Audio– Images– Text– Video

Goal

Derive document semantics based on user browsing behavior– The same document has multiple semantics

» Different people see different meanings in the same document

– Over short browsing paths, an individual user’s wants and needs are uniform

» The pages visited over these short paths exhibit semantics in congruence with these wants and needs

Questions

How can the semantics of a web page be derived given a set of user browsing paths that end at that page?

How can we characterize the semantics of a user browsing path?

How can web page semantics help us in navigating the web more efficiently?

How can our approach actually be implemented in the real web world?

Our Approach

We use actual browsing paths to find the latent semantics of web pages– Textual features– Image features– Structural features

We hope to find general concepts comprising various textual and image features which frequently co-occur

Semantic Coherence

We believe that a user’s browsing path exhibits semantic coherence– While the user’s entire path exhibits multiple

semantics, especially pages far from each other on the path, neighboring pages, especially the portions close to the links taken, are semantically close to each other

Semantic Break Points

We would like to characterize the contiguous sub-paths of a user’s browsing path that exhibit similar semantics and detect the semantic break points along the path where the semantics appreciably change– Collect these sub-paths into a multiset

Web Page Semantics

We categorize the semantics of each web page based on a history of the semantically-coherent browsing paths of all users which end at that page

A browsing path will be represented by a high-dimensional vector

The various positions of the vector correspond to the presence of– textual keywords– image features (visual keywords)– structural features (structural keywords)


From the complete set of web pages under consideration, we extract a set of textual, visual, and structural keywords

For each multiset, M, of sub-paths that we are to analyze, we form three matrices– term-path matrix– image-path matrix– structure-path matrix


The (i,j)th element of these matrices are determined by– Strength of the presence of ith keyword along the jth

browsing path» Determined by

How many times this term occurs on the pages along the path How much time the user spends examining these pages How close each occurrence of the ith keyword is to both the

outgoing and incoming anchor positions

– How many times this browsing path occurs in M


These matrices may be concatenated together in various ways to produce an overall keyword-path matrix

Perform latent-semantic analysis to get concepts

A page is then represented by a set of concept classes

Architecture

Vantage Points

path1

path2

path3

path4 path5

path6 url1

url3

url2

url4

Local Iterative Technique

Bob Hope Path – Page 1

Bob Hope Path– Page 8

Vaudeville

Broadway

Troops

Experiment 1 – Paths/Paths

Bob Hope

Broadway

Golf

MoviesRadio

Troops

Vaudeville

Experiment 2 – Paths/URLs

Bob Hope

Broadway

Golf

Movies Radio

Troops Vaudeville

Experiment 3 – URLs/URLs

Bob Hope

Broadway

Golf MoviesRadio

TroopsVaudeville

Issues

Data capture – privacy issues Compute intensive SVD updating Dynamic content Constantly evolving websites