Deriving Emergent Web Page Semantics
D.V. Sreenath*, W.I. Grosky**, and F. Fotouhi*
*Wayne State University**University of Michigan-Dearborn
Semantics
The semantics of a web page is potentially richer than can be defined by the page’s author(s)– Some semantics emerge through context – A multimedia document has multiple semantics
through being placed in multiple contexts
Content-Based Retrieval
Development of feature-based techniques for content-based retrieval is a mature area, at least for images
CBR researchers should now concentrate on extracting semantics from multimedia documents so that retrievals using concept-based queries can be tailored to individual users– The semantic gap
(Semi)-automated multimedia annotation
Multimedia Annotation(s)
Multimedia annotations should be semantically rich– Multiple semantics
This can be discovered by placing multimedia information in a natural, context-rich environment– A social theory based on how multimedia
information is used
Context-Rich Environments
Structural context – Author’s contribution– Document’s author places semantically similar pieces
of information close to each other
Dynamic context – User’s contribution– Short browsing sub-paths are semantically coherent
Context-Rich Environments
The WEB is a perfect example of a context-rich environment
Develop multimedia annotations through cross-modal techniques– Audio– Images– Text– Video
Goal
Derive document semantics based on user browsing behavior– The same document has multiple semantics
» Different people see different meanings in the same document
– Over short browsing paths, an individual user’s wants and needs are uniform
» The pages visited over these short paths exhibit semantics in congruence with these wants and needs
Questions
How can the semantics of a web page be derived given a set of user browsing paths that end at that page?
How can we characterize the semantics of a user browsing path?
How can web page semantics help us in navigating the web more efficiently?
How can our approach actually be implemented in the real web world?
Our Approach
We use actual browsing paths to find the latent semantics of web pages– Textual features– Image features– Structural features
We hope to find general concepts comprising various textual and image features which frequently co-occur
Semantic Coherence
We believe that a user’s browsing path exhibits semantic coherence– While the user’s entire path exhibits multiple
semantics, especially pages far from each other on the path, neighboring pages, especially the portions close to the links taken, are semantically close to each other
Semantic Break Points
We would like to characterize the contiguous sub-paths of a user’s browsing path that exhibit similar semantics and detect the semantic break points along the path where the semantics appreciably change– Collect these sub-paths into a multiset
Web Page Semantics
We categorize the semantics of each web page based on a history of the semantically-coherent browsing paths of all users which end at that page
A browsing path will be represented by a high-dimensional vector
The various positions of the vector correspond to the presence of– textual keywords– image features (visual keywords)– structural features (structural keywords)
Deriving Emergent Web Page Semantics
From the complete set of web pages under consideration, we extract a set of textual, visual, and structural keywords
For each multiset, M, of sub-paths that we are to analyze, we form three matrices– term-path matrix– image-path matrix– structure-path matrix
Deriving Emergent Web Page Semantics
The (i,j)th element of these matrices are determined by– Strength of the presence of ith keyword along the jth
browsing path» Determined by
How many times this term occurs on the pages along the path How much time the user spends examining these pages How close each occurrence of the ith keyword is to both the
outgoing and incoming anchor positions
– How many times this browsing path occurs in M
Deriving Emergent Web Page Semantics
These matrices may be concatenated together in various ways to produce an overall keyword-path matrix
Perform latent-semantic analysis to get concepts
A page is then represented by a set of concept classes
Top Related