Innovation through Understanding of the Data and the Human Behaviour June 12, 2008 Natasa...

61
Innovation through Understanding of the Data and the Human Behaviour June 12, 2008 Natasa Milic-Frayling Microsoft Research Cambridge Presentation at Jozef Stefan Institute, Ljubljana, Slovenia – June 12’08

Transcript of Innovation through Understanding of the Data and the Human Behaviour June 12, 2008 Natasa...

Innovation through Understanding of the Data and

the Human Behaviour

June 12, 2008

Natasa Milic-Frayling

Microsoft Research Cambridge

Presentation at Jozef Stefan Institute, Ljubljana, Slovenia – June 12’08

Web site structure analysis

InSite Live!

Support for activity management

Research Desktop

Web Site Structure AnalysisConcepts, Algorithms and Evaluation Issues

Eduarda Mendes Rodrigues† Natasa Milic-Frayling†

Martin Hicks Blaz Fortuna‡

† Microsoft Research, Cambridge, UK‡ Institute Jožef Stefan, Slovenija

Outline

Research in Web navigation Objectives and overview of the LSG approach

Concepts supported by the user study Definition and application of the LSG model

for Web site structure analysis

LSG method for partitioning Web sites into subsites

Identification of subsite entry pages

Challenges in the evaluation of subsites Evaluation methodology, issues and

guidelines

Navigation support

Site structure model

Detection of subsites

Evaluation issues

Part I

Part II

Supporting Search and Navigation

Users often use navigation as a complement to search:

preference for navigation over search

information need is clear to the user, but queries are not formulated appropriately (short and ambiguous queries, user’s inexperience with search, wrong terminology)

information need is vague or ambiguous – navigation is used for exploration of content and refining the need

Site structure representation:

navigation aid

context for search results

Objectives

Represent and analyse the navigational and content structure of individual Web sites

Identify fine-grained site boundaries and define the scope of sub-sites for a particular application

Characterize the relationship between site structure, content and usage of the Web sites

Web Link Graph

p2

pk

p1

.

.

.

pk+2

pn

pk+1

.

.

.

Page p1

p3

Nodes represent Web pages Types of links and association of links

are not represented

p3 pk

p2

pk+2

pn

pk+1

p1

Structure linksContent links

Web Link Graph

p2

pk

p1

g1

g2

.

.

.

Target pagespk+2

pn

pk+1

.

.

.

Target pages

Page p1

p3

Nodes represent Web pages Types of links and association of links

are not represented

g1

g2

Structural block

Content block

p1

p3 pk

p2

Targets of g1

. . .

Containers of g1

Targets of g2

pk+2

pn

pk+1. . .

p1

Containers of g2

Nodes of the LSG are link blocks and the edges represent a containment relationship

LSG captures page-level organization of links and the overall link structure

LSG – Link Structure Graph

Web Link Graph Nodes represent Web pages Types of links and association of links

are not represented

Concept Validation: Exploratory User Study

Objective:– to identify notions that Web users may have about sites,

organization of pages and functions of hyperlinks

Motivation: – Related work on Web page analysis has included user

evaluations of algorithms but very few have explored how users perceive Web pages and the hyperlinks that connect them

We focused on three aspects:– do Web users perceive and understand different types of links?

– are Web users able to detect associations between links present on the page?

– do Web users consider some links to be more important than others?

Study design We recruited 14 participants (9 male, 5 female) – all confirmed

that they regularly use the Internet, with an average reported usage of 25 hours/week

<1000 1000-10000 10001-100000 >1000000

2

4

6

8

10

12

Size

Num

ber

of S

ites

Google

MSNYahoo!

TOPICS SUB-TOPICS

Arts directory, literature, television

Computers

internet, software, graphics

Health conditions, diseases, occupational & safety

News newspapers, weather

Reference

libraries, education

Science institution, math, earth sciences

Society issues, government, law

A sample of 21 sites was selected from 7 top-level topic categories of the ODP directory (http://dmoz.org) - the sample included sites from four different domains (.com, .edu, .gov, .org) and of heterogeneous sizes

Study design (contd.) Session 1

Participants freely navigated 3 Web sites (approx. 5 minutes each)

For each site, participants were prompted to elaborate on site organization, the importance of links shown on the page, and to estimate the size of the site Session 2 (this session was video recorded)

Participants were shown 2 printed pages from each of the same sites and were asked to consider if the links on the pages could be grouped, e.g. by content, functionality, etc.

They were encouraged to discuss their impressions of the pages and to group and label links on the printed pages

Session 3 – Participants were shown identical pages as in session 2 via a

computer-based application, which detects links and highlights link blocks on the page, and were prompted for feedback on the detected link blocks and their prominence on the page

Study design (contd.)Page analyser application used in Session 3 of the study

• For each page, the program prompted users to:

– judge if links formed a coherent group, a menu

– indicate if some of the links had been missed out

– rank the prominence of links on the page

Session 1 observationsBased on their initial impressions and limited browsing, participants:

Rated 24 out of 42 visited sites as providing well organised content

Considered 30 out of 42 visited sites as having links of varying importance in terms of content and navigation:

an analysis of participants’ written comments revealed their opinions to be influenced by page layout (presentation of information, presence or absence of sidebars, screen clutter)

Correctly estimated size ranges for 23 of 42 sites (based on the size estimates obtained from the three search engines: Google, MSN and Yahoo!)

Session 2 observations

0

10

20

30

40

50

60

70

Num

ber

of P

age

s

Content ortopic

Navigation Other (general purpose,housekeeping, internal/external)

Administrative

Analysis of the users comments about the types of links on the pages revealed that participants characterised them as:

– content or topic (relating to the content of the site)

– navigation (for moving to other parts of the site or to external sites)

– administrative (referring to company information, privacy policy, sitemaps)

– general purpose, ‘housekeeping’ links, internal/external

Issues: Variability in Terminology and Mental Models

While grouping links, participants showed individual differences in how they were influenced by the layout and information presented to them, e.g.

– some participants used different terms for describing the same type of links: links at the bottom of a page (company information, privacy policy, site maps) were independently referred to as ‘administration’, ‘bureaucracy’, and ‘footnotes’

– two participants revealed that they ignored some content on the right side of the page; one elaborated: ‘I expect the most important links to be shown on the left as I naturally read left to right’

• This participant grouped the links on the page according to particular categories

• While the labelling suggests the links are content links, the participant regarded these as navigation links

• This participant referred to links at the top of the page as a ‘main menu’; commenting whether the PBS logo represented a link to a home page

• He further commented that the sidebar panes ‘were not very good’; he was not able to tell what function they served or whether they were related

• Note, he also highlighted links at the bottom of the page as ‘technology’ related links, which included links to news RSS feeds & podcasts, and also ‘smallprint & legalise’ links

• The same participant did not immediately acknowledge that this page was from the same site due to it’s different appearance from the previous page (home page)

• He noted that the only correlation between the two pages is the menu; commented on how the main content links on the page appeared very similar with no main headings

Issues: Consistency Across Sessions Cross session analysis revealed correlation and discrepancy in

participants’ comments across sessions, e.g.:

– One participant noted the importance of links on a page for one site during sessions 1 and 2 but did not rate them as prominent on the page analyzer application in session 3

– Conversely, another participant referred to the importance of ‘content’ and ‘navigation’ links for 2 sites during session 2 and rated these links as prominent when using the page analyzer in session 3

– Five participants ranked certain links on the page as more important than others during sessions 1 and 2. Page analyzer logs revealed that they had also rated these links as prominent or very prominent, and in some cases, assigned the same attributes to the groups of links

Summary of Findings Perceived importance of links: users could articulate and

elaborate on different functions and importance of links on pages

Structure of the page: layout, organization of content, and location of links influence user’s perception of function of links and site usability

Structure of links: participants could outline groupings of links and refer to their functions, and in some instances, consistently assigned importance rating to the groups of links

Link categories and differences in terminology: some common links were referred to using different terminology, although findings also show commonality in the links identified

Web site size estimates: although only briefly exposed to site content, users were able to provide estimates of size based on various visual and organizational aspects of the pages they visited

Outline

Research in Web navigation Objectives and overview of the LSG approach

Concepts supported by the user study Definition and application of the LSG model

for Web site structure analysis

LSG method for partitioning Web sites into subsites

Identification of subsite entry pages

Challenges in the evaluation of subsites Evaluation methodology, issues and

guidelines

Navigation support

Site structure model

Detection of subsites

Evaluation issues

Part I

Part II

Site Structure Representation

Basic concepts we use for link structure analysis:

– Structural link blocks: organizational and navigational link blocks typically repeated across pages with the same layout and underpinning the organization of the site

– Content link blocks: expected to be grouped by content associations, unlikely to be repeated across pages and point to information resources

– Isolated links: links that are not part of a link group and may be only loosely related with each other

Web graph

s-node

News homepage

Content Link Blocks (c-nodes)

News homepage

Web graph

TARGET pages

CONTAINER page(s)

Content Link Blocks (c-nodes)

c-node

LSG

Link Structure Graph (LSG)

Defined a Link Structure Graph (LSG) to captures:

– both the organization of links at the page level – and the overall hyperlink structure of the site

The graph includes 3 types of nodes:

– s-nodes (structural link blocks)– c-nodes (content link blocks)– i-nodes (bag of isolated links on the page)

LSG Algorithm Step 1

– Analyse the layout of individual pages by parsing the HTML Document Object Model (DOM) structure

– Based on the DOM paths identify candidate link blocks and the remaining isolated nodes

Step 2– Classify the link blocks into s-node and c-nodes base on

their re-usability across pages Step 3

– Connect the LSG nodes with directed edges, each of which represents a containment relationship between target pages of the source node and the destination link block

– The edges are weighted to completely preserve the information about in- and out-links of individual pages

LSG Applications: Site Structure Analysis

• Analyse the structural properties of the sample sites selected for the user study– variability with size of site, topic and domain– possible correlation with users comments

• Analyse the incremental generation of LSGs using different crawling strategies

Characterization of the Sample Sites

Despite differences in size, topic and domain, for most sites:

– more than 50% of the pages reside up to 3 directory levels down from the root directory

0 1 2 3 4 5 6 7 8 9+0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Directory Level

Rat

io o

f si

tes

with

at

leas

t p%

pag

es b

elow

fix

ed le

vel

p = 25%

p = 50%p = 75%

0 1 2 3 4 5 6 7 8 9+0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Depth Level

Rat

io o

f si

tes

with

at

leas

t p%

pag

es a

t a

fixed

dep

th

p = 25%

p = 50%p = 75%

– more than 50% of the pages are 3 to 5 clicks away from the home page

Analysis of LSG Properties

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ratio of pages that contain link blocks

ratio

of

page

s ac

cess

ible

fro

m li

nk b

lock

s

s-nodes

c-nodes

Template covers a larger structure

Content link blocks found in many pages

Few content blocks point to many pages (e.g. sitemap)

Sites with very little structure

Same template spread across many pages, but touching few pages

Analysis of LSG Properties (contd.)

• Strongly Connected Components (SCCs) of the LSG:

– sub-graphs of the LSG

– there is a path between every pair of nodes

– e.g. navigation menus which are part of thesame template

A

B

C

D

E

F

G

SCC = {A,B,C,D}

Directed LSG

Analysis of LSG Properties (contd.)

• Strongly Connected Components (SCCs) of the LSG:

– sub-graphs of the LSG

– there is a path between every pair of nodes

A

B

C

D

E

F

G

SCC = {A,B,C,D,E,F,G}

Undirected LSG

User Comments vs. LSG properties

• Comments:

‘It is not obvious how I can get at the content I want by hierarchically navigating menus’

‘I think this is a website where you need to know what you are looking for. There is a lot of work in reading the text to find out how you need to make the next step in finding what you want’

• LSG properties: very few pages of this site contain and are targeted by s-node and c-node link blocks

User Comments vs. LSG properties

• Comments:

‘I got very confused in which part I was in. The breadcrumbs said I was in ‘about us’ and I was looking at project information. I think I would use the search bar to get the information that I want.’

• LSG properties: this site has many s-node disconnected components. There are only 10 SCCs with more than 5 s-nodes each and those components only touch 4% of the pages

User Comments vs. LSG properties

• Comments:

‘Easy to navigate the required information, again through the use of grouping and in this case dual menu structure which effects the links available within the side bar which is useful’

‘The high-level topics and search bar are always available. More specific subtopics can be navigated with a panel that changes, suitably to the context’

• LSG properties: 80% of pages of this site contain s-nodes and about 20% of pages of the site are accessible through s-nodes. There are 17 SCCs with more than 5 s-nodes each and collectively touching around 5% of the pages (>1300 pages) of the site.

Outline

Research in Web navigation Objectives and overview of the LSG approach

Concepts supported by the user study Definition and application of the LSG model

for Web site structure analysis

LSG method for partitioning Web sites into subsites

Identification of subsite entry pages

Challenges in the evaluation of subsites Evaluation methodology, issues and

guidelines

Navigation support

Site structure model

Detection of subsites

Evaluation issues

Part I

Part II

Nodes of the LSG are link blocks and the edges represent a containment relationship

LSG captures page-level organization of links AND the overall link structure

LSG – Link Structure Graph

Identification of Subsites

Sites are often organized into several units of content, referring to a particular topic or function

Structure can be presented in terms of subsites

Connected structural link blocks expose the intrinsic organization of subsite content

Identification of Subsites

1. How to define the scope of a subsite?

Set of pages with a shared navigation mechanism, that are likely to present a consistent page style

LSG Strongly Connected Components (SCC) :

Navigation of a subsite involving a sequence of clicks on distinct link blocks imply a path of connected s-nodes in the LSG.

Identifying SCCs isolates nodes that are contained in pages of the same subsite

Subsite pages

Identification of Subsites

2. How to identify entry pages for a subsite?

Web page(s) that facilitate navigation around the subsite and are representative of the subsite content

Define appropriate subsite page scores

Select pages with the highest score

Entry pageSubsite pages

Page and Block Rank Scores

PageRank:

)( )(

)(

|)(|

1)(

ij pNp j

ji

pd

pPRk

GV

kpPR

Probability that a user will navigate to a given page when randomly surfing the Web.

PageRank:

LSG block rank:

)(

)(

)(

|)(|

1)(

ij gNg j

ji gD

gBRk

LSGV

kgBR

Probability that a user will see a link block on a page if randomly navigating the pages using only LSG link blocks.

Entry Page Score

)()()()( isiteisubsiteisitei gBRpPRpPRpEPR

gi is the s-node with the highest BR that is contained

in page pi

Experiments suggested =3, =2 and =1

Using LSG for Site Structure Analysis

Data set: 20 Web sites from DMOZ*, heterogeneous in

topic – covering 7 top-level DMOZ topic categories size – ranging from ~250 pages to ~40000 pages

Link block reach and spread: s-node reach reveals the coverage of content through

structural links s-node spread reveals how widespread the use of a

particular template is across site pages

In-link degree distribution

* Open directory: http://dmoz.org

High variability across sites

Evaluation Issues

Main issues with the evaluation of the detected subsites:

Evaluation of subsites and entry pages requires manual inspection of all the pages of the site, which is impractical

Representative set of web sites should be used to evaluate the algorithms

Pilot study with 2 of the sites from our sample to gain further insight into the complexity of the evaluation task

Proposed evaluation methodology:

Pooling method to obtain candidate entry pages from multiple systems

Engage human assessors to browse each site and decide if the pages from the pool were entry pages of a subsite or not

Pool of Entry Pages for Assessment

A B C D E A B C D EA 5 1 0 3 1 6 0 6 2 0B 2 0 0 0 2 2 0 0

C 4 1 1114

10 15

D 24 6 125 41E 10 70

Site A: www.artifice.com

Site B: www.sigmaxi.org

A. Entry pages manually selected by experts,

B. Pages from the Web site included in the DMOZ directory

C. Index pages such as ‘index.*’ or ‘default.*’

D. First target page of all s-node link blocks

E. Top ranked page, according to the EPR score, for each subsite detected by the LSG decomposition into strongly connected components

Entry Page Assessments

Assessor J1

Yes No Total Yes No Total

AssessorJ2

Yes 1 4 5 7 17 24

No 5 24 29 43 179 222

Total 6 28 34 50 196 246Site A: www.artifice.com Site B: www.sigmaxi.org

Pilot study included 2 human assessors (J1 and J2) that evaluated all entry pages from the pool

Simple GUI to display entry pages and input assessments (yes/no and confidence level in the assessment)

Good agreement on negative assessments, but not so good on the positive ones

Confidence levels on the assessments generally higher on site B

Results of the Entry Page Assessment

A(manual)

B(DMOZ)

C(index*)

D (s-node)

E(EPR)

Site A: www.artifice.com

Assessor J1P: 20%R: 17%

P: 100%R: 33%

P: 25%R: 17%

P: 4%R: 17%

P: 20%R: 33%

Assessor J2P: 20%R: 11%

P: 100%R: 22%

P: 25%R: 11%

P: 21%R: 56%

P: 20%R: 22%

Total pages 5 2 4 24 10

Site B: www.sigmaxi.org

Assessor J1P: 83%R: 8%

P: 100%R: 3%

P: 49%R: 93%

P: 10%R: 20%

P: 20%R: 22%

Assessor J2P: 67%R: 17%

P: 100%R: 8%

P: 19%R: 92%

P: 4% R: 21%

P: 13%R: 38%

Total pages 6 12 114 125 70

(P: set precision, R: set recall – relative to own individual assessments)

Guidelines for Evaluation Support

Provide quick access to the pages in the vicinity of a given page (i.e., the parent, child and sibling nodes)

Provide visual cues such as page thumbnails of flexible size

Make the relationship between the URL and the links on the parent page explicit

Provide easy access to pages that have already been visited during evaluation (e.g. present a navigation trail)

Enable the assessors to customize presentation of candidate entry pages, i.e., as a sorted list, graph, etc.

Concluding Remarks

LSG model – enables in-depth analysis of Web sites and identification of subsites

Proposed a pooling method for gathering relevance judgements

Defined an evaluation methodology and presented guidelines to assist in the creation of a test data set

Future work: large scale evaluation of algorithms for subsite and entry page detection

Subsite structure of the Microsoft Research Web Site – selected pages (recently most visited)

Research Desktop Activity Based Computing

Eduarda Mendes Rodrigues Natasa Milic-Frayling†

Gabriella Kazai Gavin Smyth

Rachel Jones

Gerard Oleksik

Research desktopIntegrated Systems

Background• Scholars, researchers• Range of activities in order to accomplish tasks

– Gathering relevant sources of information – Reading through the material – Annotating and note taking– Analyzing the material– Communicate findings to colleagues – Author publications

• Workflows of different styles– Structured and un-structured– Short-lived projects and life-long work

Research Desktop

• Research Desktop augments the standard desktop environment with concepts and designs that enable new ways of working and managing resources

• It provides support in four key areas: – Activities– Tools– Library– Notes.  

Research Desktop Activities

• Activity-centric content access• Label (tag) related resources• Activate a task or switch between 

multiple tasks• Resume work• Preserved state• Activity monitor• Toolbar plug-in

Library and Notes

• Dedicated information spaces:  – Personal Library – Notes. 

Tools

• Tools and services used in various contexts

• Brings tools to the user• Examples:

– Document analysis – Co-author network– Trends discovery

Thank you!

Contact Natasa Milic-Frayling

[email protected] Systems

http://research.microsoft.com/is