Adaptive Web Sites: Automatically Synthesizing Web Pages

34
Adaptive Web Sites: Adaptive Web Sites: Automatically Synthesizing Web Pages Mike Perkowitz and Oren Etzioni www.cs.washington.edu/homes/map/adaptive/

description

Adaptive Web Sites: Automatically Synthesizing Web Pages. Mike Perkowitz and Oren Etzioni www.cs.washington.edu/homes/map/adaptive/. Adaptive Web Sites. Web sites that automatically reconfigure their organization and presentation by learning from user access patterns. - PowerPoint PPT Presentation

Transcript of Adaptive Web Sites: Automatically Synthesizing Web Pages

Page 1: Adaptive Web Sites: Automatically Synthesizing  Web Pages

Adaptive Web Sites:Adaptive Web Sites:Automatically Synthesizing

Web Pages

Mike Perkowitz and Oren Etzioni

www.cs.washington.edu/homes/map/adaptive/

Page 2: Adaptive Web Sites: Automatically Synthesizing  Web Pages

2

Adaptive Web SitesAdaptive Web Sites

Web sites that automatically reconfigure Web sites that automatically reconfigure their organization and presentation by their organization and presentation by learning from user access patterns. learning from user access patterns.

(Perkowitz & Etzioni, IJCAI’97)

Page 3: Adaptive Web Sites: Automatically Synthesizing  Web Pages

3

Adaptive Web SitesAdaptive Web Sites

• Individual CustomizationIndividual Customization: site learns you like sports

• Group TransformationGroup Transformation: site learns most sports lovers also read “Tank McNamara” and cross-links them

Page 4: Adaptive Web Sites: Automatically Synthesizing  Web Pages

4

Group TransformationsGroup Transformations

• Our approach: history-based

• Previously: Simple transformations (Perkowitz & Etzioni, WWW6)

• Goal: change in viewchange in view

Page 5: Adaptive Web Sites: Automatically Synthesizing  Web Pages

5

machines.hyperreal.orgmachines.hyperreal.org

Page 6: Adaptive Web Sites: Automatically Synthesizing  Web Pages

6

Drum Machine Samples

Page 7: Adaptive Web Sites: Automatically Synthesizing  Web Pages

7

Page 8: Adaptive Web Sites: Automatically Synthesizing  Web Pages

8

Index Page SynthesisIndex Page Synthesis

Find groups of related documents at the site Find groups of related documents at the site and create new pages linking to those and create new pages linking to those

documents.documents.

• Input: web site, access log• Output: pages of links to related

pages

Page 9: Adaptive Web Sites: Automatically Synthesizing  Web Pages

9

QuestionsQuestions

• What links are on the index page?What links are on the index page?• How are the contents ordered?• What is the title?• How are links labeled?• How do we make the index comprehensive?

Page 10: Adaptive Web Sites: Automatically Synthesizing  Web Pages

10

OutlineOutline

• Motivation• Plausible approachesPlausible approaches

– Clustering– Frequent sets

• Our approachOur approach: Cluster Mining: Cluster Mining– Algorithm: PageGather

• EvaluationEvaluation

Page 11: Adaptive Web Sites: Automatically Synthesizing  Web Pages

11

ClusteringClustering

Voorhees-86,Willet-88,Rasmussen-92

• SimilaritySimilarity metric over documents• Cluster: items close togetherclose together, far from others

Algorithms:• Hierarchical Agglomerative Clustering (HAC)• K-means clustering

Page 12: Adaptive Web Sites: Automatically Synthesizing  Web Pages

12

ClusteringClustering

VisitVisit: set of pages accessed by an individual

• Document = page• Similarity = co-occurrence in visits• Cluster index page contents

Page 13: Adaptive Web Sites: Automatically Synthesizing  Web Pages

13

Clustering: ProblemsClustering: Problems

• Clustering induces a partitionpartition over data

• Clustering can be slowslow

Page 14: Adaptive Web Sites: Automatically Synthesizing  Web Pages

14

Frequent SetsFrequent Sets

Agrawal, Imielinski, & Swami-93

• Set of transactionstransactions: “basket” of items• Find all frequently-occurring itemsetsfrequently-occurring itemsets

Algorithm:• A priori

Page 15: Adaptive Web Sites: Automatically Synthesizing  Web Pages

15

Frequent SetsFrequent Sets

VisitVisit: set of pages accessed by an individual

• Item = page• Transaction = visit• Frequent set index page contents

Page 16: Adaptive Web Sites: Automatically Synthesizing  Web Pages

16

Frequent Sets: ProblemsFrequent Sets: Problems

• “Frequent Item ProblemFrequent Item Problem”

• Finds many similar itemsets

• low minimum frequency high running time

Page 17: Adaptive Web Sites: Automatically Synthesizing  Web Pages

17

Idea: Cluster MiningIdea: Cluster Mining

• Find only high-qualityonly high-quality clusters

• Not a partition

• Clusters may overlapoverlap

Page 18: Adaptive Web Sites: Automatically Synthesizing  Web Pages

18

The PageGather AlgorithmThe PageGather Algorithm

• Graph-basedGraph-based representation– Nodes: pages

– Edges: if P(P1|P2) and P(P2|P1) is high

• Fast Fast and accurate accurate

Page 19: Adaptive Web Sites: Automatically Synthesizing  Web Pages

19

www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpiderwww.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit)www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I)www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I)www.apache.org|r2d2.dd.dk|GET /docs/ HTTP/1.0|text/html|200|1997/07/03-23:59:11|-|2207|-|-|http://www.apache.org/|Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/oliver_lieb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:11|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit)www.hyperreal.org|du5-ts1.lascruces.com|GET /~wally/epsilon.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:11|-|4002|-|-|http://www.hyperreal.org/music/artists/fsol/www/|Mozilla/2.0 (compatible; MSIE 3.02; Update a; Windows 95)www.hyperreal.org|du5-ts1.lascruces.com|GET /~wally/hyperreal.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:11|-|2525|-|-|http://www.hyperreal.org/music/artists/fsol/www/|Mozilla/2.0 (compatible; MSIE 3.02; Update a; Windows 95)www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/baked_beans.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:11|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit)www.hyperreal.org|cc6145d.comm.sfu.ca|GET /music/machines/categories/effects/ HTTP/1.0|text/html|200|1997/07/03-23:59:12|-|3844|-|-|http://www.hyperreal.org/music/machines/categories/|Mozilla/2.02 (Macintosh; I

Log

/97/Winter/Final/

/97/Spring/Final/

/96/Autumn/Final/

/97/Spring/Midterm/

/96/Autumn/Midterm/

www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpiderwww.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit)www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I)www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I)

www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpiderwww.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit)www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I)www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I)

www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpiderwww.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit)www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I)www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I)

Visits Co-occurrence

GraphClique/CCNew Page

/97/Winter/Final/

/97/Spring/Final/

/96/Autumn/Final/

/97/Spring/Midterm/

/96/Autumn/Midterm/

Page 20: Adaptive Web Sites: Automatically Synthesizing  Web Pages

20

PageGatherPageGather

• Implement with Cliques or CCs– Find all candidates, return best– Clique: maximal cliques of size k

• Clique and CC versions comparable in time and performance

Page 21: Adaptive Web Sites: Automatically Synthesizing  Web Pages

21

ExperimentsExperiments

machines.hyperreal.org

• Site gets ~1200 visitors/day1200 visitors/day (10k hits)• Site contains ~2500 distinct documents2500 distinct documents

• TrainingTraining: a month of access data• TestingTesting: ten days of data

Page 22: Adaptive Web Sites: Automatically Synthesizing  Web Pages

22

Performance MetricPerformance Metric

Are index pages helpful to users?Are index pages helpful to users?

• How well do clusters predict user navigation?

• Q(C) = Given that a user visits one page in cluster C, how likely is she to visit any other?

Page 23: Adaptive Web Sites: Automatically Synthesizing  Web Pages

23

Cluster Mining vs. ClusteringCluster Mining vs. Clustering

PageGather using• Clique 10 clusters 1:05 min

• HAC 10 clusters 48+ hours

• K-means 10 clusters 3:35 min

Page 24: Adaptive Web Sites: Automatically Synthesizing  Web Pages

24

Cluster Mining vs. ClusteringCluster Mining vs. Clustering

PageGather using• Clique 10 clusters 1:05 min

• HAC 10 clusters 48+ hours

• K-means 10 clusters 3:35 min • HAC* 8 clusters 21:55 min

(threshold, less data, mining)

Page 25: Adaptive Web Sites: Automatically Synthesizing  Web Pages

25

Cluster Mining vs. ClusteringCluster Mining vs. Clustering

PageGather using• Clique 10 clusters 1:05 min

• HAC 10 clusters 48+ hours

• K-means 10 clusters 3:35 min • HAC* 7 clusters 293:08 min

(threshold, less data, mining)

Page 26: Adaptive Web Sites: Automatically Synthesizing  Web Pages

26

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10

Clique K-Means

Cluster Mining vs. ClusteringCluster Mining vs. Clustering

Top 10 Clusters

Q

Page 27: Adaptive Web Sites: Automatically Synthesizing  Web Pages

27

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10

Clique K-Means HAC*

Cluster Mining vs. ClusteringCluster Mining vs. Clustering

Top 10 Clusters

Q

Page 28: Adaptive Web Sites: Automatically Synthesizing  Web Pages

28

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10

Clique K-Means HAC*

Cluster Mining vs. ClusteringCluster Mining vs. Clustering

Top 10 Clusters

Q

Page 29: Adaptive Web Sites: Automatically Synthesizing  Web Pages

29

PageGather vs. Frequent SetsPageGather vs. Frequent Sets

• PG/Clique 10 clusters 1:05 min• A priori 10 frequent sets 1:41 min

Page 30: Adaptive Web Sites: Automatically Synthesizing  Web Pages

30

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10

PageGather Frequent Sets

PageGather vs. Frequent SetsPageGather vs. Frequent Sets

Top 10 Clusters

Q

Page 31: Adaptive Web Sites: Automatically Synthesizing  Web Pages

31

ContributionsContributions

• Motivating problem: Web page synthesisWeb page synthesis

• Method: Cluster miningCluster mining– well suited for discovery of coherent sets– comparison to clustering, frequent sets

• Algorithm: PageGatherPageGather– graph-based, fast and accurate

Page 32: Adaptive Web Sites: Automatically Synthesizing  Web Pages

32

Clique vs. Conn-componentClique vs. Conn-component

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10

Connected Clique

Top 10 Clusters

Q

Page 33: Adaptive Web Sites: Automatically Synthesizing  Web Pages

33

Clique vs. Conn-componentClique vs. Conn-component

• Comparable accuracy• Clique finds fewer, smaller clusters than CC• Clique: more accurate (at first)• Comparable running time (in practice)

Page 34: Adaptive Web Sites: Automatically Synthesizing  Web Pages

34

Future DirectionsFuture Directions

• Meta-InformationMeta-Information to improve coherence

• Conceptual clusteringConceptual clustering – Improve coherence– Naming pages

• Cluster mining to generate association rulesassociation rules