Adaptive Web Sites: Automatically Synthesizing Web Pages
description
Transcript of Adaptive Web Sites: Automatically Synthesizing Web Pages
Adaptive Web Sites:Adaptive Web Sites:Automatically Synthesizing
Web Pages
Mike Perkowitz and Oren Etzioni
www.cs.washington.edu/homes/map/adaptive/
2
Adaptive Web SitesAdaptive Web Sites
Web sites that automatically reconfigure Web sites that automatically reconfigure their organization and presentation by their organization and presentation by learning from user access patterns. learning from user access patterns.
(Perkowitz & Etzioni, IJCAI’97)
3
Adaptive Web SitesAdaptive Web Sites
• Individual CustomizationIndividual Customization: site learns you like sports
• Group TransformationGroup Transformation: site learns most sports lovers also read “Tank McNamara” and cross-links them
4
Group TransformationsGroup Transformations
• Our approach: history-based
• Previously: Simple transformations (Perkowitz & Etzioni, WWW6)
• Goal: change in viewchange in view
5
machines.hyperreal.orgmachines.hyperreal.org
6
Drum Machine Samples
7
8
Index Page SynthesisIndex Page Synthesis
Find groups of related documents at the site Find groups of related documents at the site and create new pages linking to those and create new pages linking to those
documents.documents.
• Input: web site, access log• Output: pages of links to related
pages
9
QuestionsQuestions
• What links are on the index page?What links are on the index page?• How are the contents ordered?• What is the title?• How are links labeled?• How do we make the index comprehensive?
10
OutlineOutline
• Motivation• Plausible approachesPlausible approaches
– Clustering– Frequent sets
• Our approachOur approach: Cluster Mining: Cluster Mining– Algorithm: PageGather
• EvaluationEvaluation
11
ClusteringClustering
Voorhees-86,Willet-88,Rasmussen-92
• SimilaritySimilarity metric over documents• Cluster: items close togetherclose together, far from others
Algorithms:• Hierarchical Agglomerative Clustering (HAC)• K-means clustering
12
ClusteringClustering
VisitVisit: set of pages accessed by an individual
• Document = page• Similarity = co-occurrence in visits• Cluster index page contents
13
Clustering: ProblemsClustering: Problems
• Clustering induces a partitionpartition over data
• Clustering can be slowslow
14
Frequent SetsFrequent Sets
Agrawal, Imielinski, & Swami-93
• Set of transactionstransactions: “basket” of items• Find all frequently-occurring itemsetsfrequently-occurring itemsets
Algorithm:• A priori
15
Frequent SetsFrequent Sets
VisitVisit: set of pages accessed by an individual
• Item = page• Transaction = visit• Frequent set index page contents
16
Frequent Sets: ProblemsFrequent Sets: Problems
• “Frequent Item ProblemFrequent Item Problem”
• Finds many similar itemsets
• low minimum frequency high running time
17
Idea: Cluster MiningIdea: Cluster Mining
• Find only high-qualityonly high-quality clusters
• Not a partition
• Clusters may overlapoverlap
18
The PageGather AlgorithmThe PageGather Algorithm
• Graph-basedGraph-based representation– Nodes: pages
– Edges: if P(P1|P2) and P(P2|P1) is high
• Fast Fast and accurate accurate
19
www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpiderwww.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit)www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I)www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I)www.apache.org|r2d2.dd.dk|GET /docs/ HTTP/1.0|text/html|200|1997/07/03-23:59:11|-|2207|-|-|http://www.apache.org/|Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/oliver_lieb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:11|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit)www.hyperreal.org|du5-ts1.lascruces.com|GET /~wally/epsilon.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:11|-|4002|-|-|http://www.hyperreal.org/music/artists/fsol/www/|Mozilla/2.0 (compatible; MSIE 3.02; Update a; Windows 95)www.hyperreal.org|du5-ts1.lascruces.com|GET /~wally/hyperreal.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:11|-|2525|-|-|http://www.hyperreal.org/music/artists/fsol/www/|Mozilla/2.0 (compatible; MSIE 3.02; Update a; Windows 95)www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/baked_beans.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:11|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit)www.hyperreal.org|cc6145d.comm.sfu.ca|GET /music/machines/categories/effects/ HTTP/1.0|text/html|200|1997/07/03-23:59:12|-|3844|-|-|http://www.hyperreal.org/music/machines/categories/|Mozilla/2.02 (Macintosh; I
Log
/97/Winter/Final/
/97/Spring/Final/
/96/Autumn/Final/
/97/Spring/Midterm/
/96/Autumn/Midterm/
www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpiderwww.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit)www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I)www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I)
www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpiderwww.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit)www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I)www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I)
www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpiderwww.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit)www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I)www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solariswww.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I)
Visits Co-occurrence
GraphClique/CCNew Page
/97/Winter/Final/
/97/Spring/Final/
/96/Autumn/Final/
/97/Spring/Midterm/
/96/Autumn/Midterm/
20
PageGatherPageGather
• Implement with Cliques or CCs– Find all candidates, return best– Clique: maximal cliques of size k
• Clique and CC versions comparable in time and performance
21
ExperimentsExperiments
machines.hyperreal.org
• Site gets ~1200 visitors/day1200 visitors/day (10k hits)• Site contains ~2500 distinct documents2500 distinct documents
• TrainingTraining: a month of access data• TestingTesting: ten days of data
22
Performance MetricPerformance Metric
Are index pages helpful to users?Are index pages helpful to users?
• How well do clusters predict user navigation?
• Q(C) = Given that a user visits one page in cluster C, how likely is she to visit any other?
23
Cluster Mining vs. ClusteringCluster Mining vs. Clustering
PageGather using• Clique 10 clusters 1:05 min
• HAC 10 clusters 48+ hours
• K-means 10 clusters 3:35 min
24
Cluster Mining vs. ClusteringCluster Mining vs. Clustering
PageGather using• Clique 10 clusters 1:05 min
• HAC 10 clusters 48+ hours
• K-means 10 clusters 3:35 min • HAC* 8 clusters 21:55 min
(threshold, less data, mining)
25
Cluster Mining vs. ClusteringCluster Mining vs. Clustering
PageGather using• Clique 10 clusters 1:05 min
• HAC 10 clusters 48+ hours
• K-means 10 clusters 3:35 min • HAC* 7 clusters 293:08 min
(threshold, less data, mining)
26
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10
Clique K-Means
Cluster Mining vs. ClusteringCluster Mining vs. Clustering
Top 10 Clusters
Q
27
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10
Clique K-Means HAC*
Cluster Mining vs. ClusteringCluster Mining vs. Clustering
Top 10 Clusters
Q
28
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10
Clique K-Means HAC*
Cluster Mining vs. ClusteringCluster Mining vs. Clustering
Top 10 Clusters
Q
29
PageGather vs. Frequent SetsPageGather vs. Frequent Sets
• PG/Clique 10 clusters 1:05 min• A priori 10 frequent sets 1:41 min
30
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10
PageGather Frequent Sets
PageGather vs. Frequent SetsPageGather vs. Frequent Sets
Top 10 Clusters
Q
31
ContributionsContributions
• Motivating problem: Web page synthesisWeb page synthesis
• Method: Cluster miningCluster mining– well suited for discovery of coherent sets– comparison to clustering, frequent sets
• Algorithm: PageGatherPageGather– graph-based, fast and accurate
32
Clique vs. Conn-componentClique vs. Conn-component
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10
Connected Clique
Top 10 Clusters
Q
33
Clique vs. Conn-componentClique vs. Conn-component
• Comparable accuracy• Clique finds fewer, smaller clusters than CC• Clique: more accurate (at first)• Comparable running time (in practice)
34
Future DirectionsFuture Directions
• Meta-InformationMeta-Information to improve coherence
• Conceptual clusteringConceptual clustering – Improve coherence– Naming pages
• Cluster mining to generate association rulesassociation rules