Data Miing and Knowledge Discvoery - Web Data Mining
Transcript of Data Miing and Knowledge Discvoery - Web Data Mining
Web Mining Research at the Center for Web Intelligence
Bamshad MobasherCenter for Web Intelligence
School of Computer Science, Telecommunication, and Information SystemsDePaul University, Chicago
Bamshad MobasherCenter for Web Intelligence
School of Computer Science, Telecommunication, and Information SystemsDePaul University, Chicago
2
What is Web Mining
i From its very beginning, the potential of extracting valuable knowledge from the Web has been quite evident4Web mining is the collection of technologies to fulfill this potential.
Web Mining Definition
application of data mining and machine learning techniques to extract useful knowledge from the content, structure, and usage of Web resources.
application of data mining and machine learning techniques to extract useful knowledge from the content, structure, and usage of Web resources.
3
A Taxonomy for Web Mining
Web Mining
DatabaseApproachDatabaseApproach
Web ContentMining
IR / IAApproach
Intelligent Search ToolsInfo. Filtering/CategorizationPersonalized Info. AgentsCo-Citation AnalysisContent-Based Filtering
Web Query SystemsConcept HierarchiesStructure DiscoveryOntology LearningWeb Info. Integration
E-Commerce Data AnalysisSite Evaluation/optimizationE-Metrics and Web AnalyticsUser Modeling/ProfilingWeb PersonalizationWeb Communities
Web Structure Mining
Web Structure Mining
Web UsageMining
Web UsageMining
4
Web Usage Mining
i The Problem4 Analyze Web navigational data to find how the Web site is used by Web
usersi Current Approaches
4 Statistical reports4 OLAP4 Data mining/Machine Learning techniques
i Applications4 Web server caching4 Link prediction and analysis4 Web personalization4 Web site adaptation4 Web site evaluation and reorganization
i Challenge4 Quantitatively capture Web users’ common interests and characterize their
underlying tasks
5
Web Usage Mining as a Process
Web &ApplicationServer Logs
Data CleaningPageview Identification
SessionizationData Integration
Data Transformation
Data Preprocessing
UserTransactionDatabase
User/Session ClusteringPageview ClusteringCorrelation Analysis
Association Rule MiningSequential Pattern Mining
Usage Mining
Patterns
Pattern FilteringAggregation
Characterization
Pattern Analysis
Site Content& Structure
Domain Knowledge
AggregateUser Models
Data Preparation Phase Pattern Discovery Phase
6
Common Data Mining Tasks
iClustering:4 Automatically group together users with similar purchasing or navigational
patternshUser / customer segments
4 Automatically group together items based on co-occurrence in user sessions4 Automatic creation of concept or functional hierarchies for the site
iClassification4 categorizing pages or items into a concept hierarchy 4 classifying users into behavioral groups (e.g., browser, likely to purchase,
loyal customer, etc.)
7
Common Data Mining Tasks
iAssociation Rules4 Associating presence of a set of items with other sets of items
hX Y, where X and Y are sets of itemshSupport of the itemset X ∪ Y: Pr(X,Y); Confidence of rule: Pr(Y|X)
4 Examples:h30% of clients who accessed /special-offer.html, placed an online order in
/products/software/
iSequential Patterns / Path Analysis4 Finding common sequences of events/items appearing frequently in transactions
h “x% of the time, when ABC appears in a transaction, C appears within z transactions”)
h40% of people who bought the book “How to cheat IRS” booked a flight to South America 6 months later
h15% of visitors followed the path home > * > software > * > shopping cart > checkout
8
Example: Path Analysis for Ecommerce
Visit
Search(64% successful)
No Search
Last Search SucceededLast Search Failed
10%90%
Avg sale per visit: 2.2X
Avg sale per visit: $X
Avg sale per visit: 2.8XAvg sale per visit: 0.9X
70% 30%
9
Example: Association Analysis for Ecommerce
Product Association Lift Confidence
WebsiteRecommended Products
J Jasper Towels
FullyReversibleMats
456 41%Egyptian CottonTowels
White CottonT-Shirt Bra
PlungeT-Shirt Bra 246 25%
Black embroidered underwired bra
Confidence 1.4%
Confidence 1%
i Confidence: 41% who purchased Fully Reversible Mats also purchased Egyptian Cotton Towelsi Lift: People who purchased Fully Reversible Mats were 456 times more likely to purchase the
Egyptian Cotton Towels compared to the general population
10
Web Usage Mining: clustering example
i Transaction Clusters: 4 Clustering similar user transactions and using centroid of each cluster as
a usage profile (representative for a user segment)
Current Degree Descriptions 2002/programs/0.88
M.S. in Distributed Systems program description
/programs/2002/gradds2002.asp0.82
SE 450 course description in SE program
/programs/courses.asp?depcode=96&deptmne=se&courseid=450
0.85
Web page of a lecturer who thought the above course
/people/facultyinfo.asp?id=2900.97
SE 450 Object-Oriented Development class syllabus
/courses/syllabus.asp?course=450-96-303&q=3&y=2002&id=290
1.00Pageview DescriptionURLSupport
Sample cluster centroid from dept. Web site (cluster size =330)
11
Latent Variable Models
i Assume the existence of a set of latent (unobserved) variables (or factors) which “explain” the underlying relationships between two sets of observed variables.
Probabilistic Latent Semantic Analysis (PLSA) :
Probabilistically determine the association between each latent factor and pages, or between each factor and users.
Latent factors correspond to distinguishable patterns usually associated with performing certain navigational task.
p1 p2 pn
u1 u2 um
p3
u3
z1 z2 zk
Pageviews
Latent Factors
User sessions
12
“Task-Oriented” User TrackingExample from CTI Data
0.3994Task 25
0.0489Task 20
Online application - finishDepartment main page
Top factors given this user
Online application – step 2
0.4527Task 3
Application – status check
A real user session (page listed in the order of being visited)
Admission main pageWelcome information – Chinese versionAdmission info for international students
Admission - requirementsAdmission – mail request
Admission – orientation infoAdmission – F1 visa and I20 info
Online application - startOnline application – step 1
0.0458Task 26
ProgramsOnline application – step 1
PageNameDepartment main page
Admission requirementsAdmission main page
Admission costs
Admission – international students…
…
PageNameOnline application – startOnline application – step1Online application – step2Online application - finish
Department main page
Online application - submit
13
Web Personalization
i The Problem4 dynamically serve customized content (pages, products, recommendations,
etc.) to users based on their profiles, preferences, or expected interests
i Current Approaches4 collaborative filtering
hgive recommendations to a user based on ratings of “similar” usershmost successful and wide-spread approachhbut, suffers from scalability problems given many items and users
4 content-based filteringh track pages the user visits and recommend other pages with similar contenthmay miss relationships among objects not based on content similarity
4 rule-based filteringhprovide content to users based on predefined rules (e.g., “if user has
clicked on A and the user’s zip code is 90210, then add a link to C”)h relies on static profiles for users obtained through explicit feedback
14
Web Mining Approach to Personalization
i Basic Idea4 generate aggregate user models (usage profiles) by discovering user
access patterns through Web usage mining (offline process)
4 match a user’s active session against the discovered models to provide dynamic content (online process)
i Advantages4 no explicit user ratings or interaction with users4 helps preserve user privacy, by making effective use of anonymous data4 enhance the effectiveness and scalability of collaborative filtering4 more accurate and broader recommendations than content-only approaches
15
Automatic Web Personalization:
Recommendation EngineRecommendation Engine
Web Server Client BrowserActive Session
RecommendationsIntegrated User Profile
AggregateUsage Profiles
<user,item1,item2,…>
Stored User Profile
Domain Knowledge
16
Quantitative Evaluation of Recommendation Effectiveness
i Two important factors in evaluating recommendations4 Precision: measures the ratio of “correct” recommendations to all
recommendations produced by the system
hlow precision would result in angry or frustrated users
4 Coverage: measures the ratio of “correct” recommendations to all pages/items that will be accessed by user
hlow coverage would inhibit the ability of the system to give relevant recommendations at critical points in user navigation
i Transactions Divided into Training & Evaluation Setshtraining set is used to build models (generation of aggregate profiles,
neighborhood formation)
hevaluation set is used to measure precision & coverage
h10-Fold Cross Validation generally used in the experiments
17
Problems with Web Usage Mining
i New item problem4 Patterns will not capture new items recently added4 Bad for dynamic Web sites, and in the context of collaborative filtering
i Poor machine interpretability4 Hard to generalize and reason about patterns
hNo domain knowledge used to enhance resultshE.g., knowing a student is interested in a particular program, we could
recommend the prerequisites, core or popular courses in this program, related courses in other programs, etc.
i Poor insight into the patterns themselves 4 The nature of the relationships among items or users in a pattern is not directly
available
i Solution: Integrate Semantic Knowledge with Web Usage Mining
18
Ontology-Based Usage Mining
i Approach 1: Ontology-Enhanced Transactions4 Initial transaction vector: t = <weight(p1,t), …, weight(Pn,t)>4 Transform into content-enhanced transaction:
t = <weight(o1,t), …, weight(or,t)>4 The structured objects o1, …, or are instances on ontology entities
extracted from pages p1, …, pn in the transaction4 Now mining tasks can be performed based on ontological similarity among
user transactions
i Approach 2: Ontology-Enhanced Patterns4 Discover usage patterns in the standard way4 Transform patterns by creating an aggregate representation of the
patterns based on the ontologyhRequires the categorization of similar objects into ontology classeshAlso requires the specification of different aggregation/combination
function for each attribute of each class in the ontology
19
Ontology-Based Pattern Aggregation
Year
{2000}
{1999}
{2002}
{S: 0.6; W: 0.4}
{S: 0.5; T:0.5}
{S: 0.7; T: 0.2; U: 0.1}
Name
Genre-All
Romance Comedy
Romantic Comedy
Kid&Family
Romance
Genre-All
Comedy
Genre-All
[1999, 2002]
Year
{S: 0.58; T: 0.27; W:0.09; U: 0.05}
{A: 0.5; B: 0.35; C: 0.15}
ActorGenreName
Genre-All
Romance Comedy
Object Extraction
Ontology-Based
Aggregation
Comedy
Genre ActorUsage profile
{A}Movie 1:0.50 Movie1.html0.35 Movie2.html0.15 Movie3.html
{B}Movie 2:
{C}Movie 3:
Semantic Usage pattern
20
Example: Semantically Enhanced Collaborative Filtering
i Basic Idea:4 Extend item-based collaborative filtering to incorporate both similarity
based on ratings (or usage) as well as semantic similarity based on domain knowledge
i Structured semantic knowledge about items4 Extracted automatically from the Web based on domain-specific reference
ontologies4 Used in conjunction with user-item mappings to create a combined
similarity measure for item comparisons4 Singular value decomposition used to reduce noise in the semantic data
i Semantic combination threshold4 Used to determine the proportion of semantic and rating (or usage)
similarities in the combined measure
21
Web Content Mining
i Discovery of meaningful patterns from page content and meta data across the Web or in a particular Web site
i Examples of Techniques Used4 text mining4 document categorization and clustering4 semantic analysis
i Applications4 content-based filtering4 Information extraction4 intelligent interface agents4 discovery of user profiles4 ontology acquisition and learning4 recommendation systems4 opinion extraction from the Web4 contextual information access
22
Contextual Information Access
i Integrating multiple sources of evidence4 Short-term user context; Long-term user profiles; Domain knowledge
23
ARCH Background
i ARCH - Adaptive Agent for Retrieval Based on Concept Hierarchies
4 Essence of the system is to incorporate domain-specific concept hierarchies with interactive query formulation
4 Query enhancement in ARCH uses two mutually-supporting techniques:
hSemantic – using a concept hierarchy to interactively disambiguate and expand querieshBehavioral – observing user’s past browsing behavior for user profiling
and automatic query enhancement
25
Overview of the System
i The system consists of an offline and an online component
iOffline component:4 Handles the learning of the concept hierarchy4 Handles the learning of the user profiles
iOnline component:4 Displays the concept hierarchy to the user4 Allows the user to select/deselect nodes4 Generates the enhanced query based on the user’s interaction with the
concept hierarchy
27
Example from Yahoo Hierarchy
Term Vector for "Genres:"
music: 1.000blue: 0.15new: 0.14artist: 0.13jazz: 0.12review: 0.12band: 0.11polka: 0.10festiv: 0.10celtic: 0.10freestyl: 0.10
Term Vector for "Genres:"
music: 1.000blue: 0.15new: 0.14artist: 0.13jazz: 0.12review: 0.12band: 0.11polka: 0.10festiv: 0.10celtic: 0.10freestyl: 0.10
28
Aggregate Representation of Nodes in the Hierarchy
i A node is represented as a weighted term vector:4centroid of all documents and subcategories indexed under the node
n = node in the concept hierarchyDn = collection of individual documents Sn = subcategories under nTd = weighted term vector for document d indexed under node nTs = the term vector for subcategory s of node n
(( ) / | | / | | 1n n
n d n s nd D s S
T T D T S∈ ∈
= + +
∑ ∑
29
Online Component – User Interaction with the Hierarchy
i The initial user query is mapped to the relevant portions of hierarchy4 user enters a keyword query4 system matches the term vectors representing each node in the hierarchy
with the keyword query4 nodes which exceed a similarity threshold are displayed to the user, along
with other adjacent nodes.
i Semi-automatic derivation of user context4 ambiguous keyword might cause the system to display several different
portions of the hierarchy 4 user selects categories which are relevant to the intended query, and
deselects categories which are not
30
Generating the Enhanced Query
i Based on an adaptation of Rocchio's method for relevance feedback4 Using the selected and deselected nodes, the system produces a
refined query Q2:
4 each Tsel is a term vector for one of the nodes selected by the user, 4 each Tdesel is a term vector for one of the deselected nodes4 factors α, β, and γ are tuning parameters representing the relative
weights associated with the initial query, positive feedback, and negative feedback, respectively such that α + β - γ = 1.
2 1 sel deselQ Q T Tα β γ= ⋅ + ⋅ − ⋅∑ ∑
31
An Example
-
MusicMusic
GenresGenresArtistsArtists New ReleasesNew Releases
BluesBlues JazzJazz New AgeNew Age . . .
DixielandDixieland
+
+
+ . . .
Initial Query “music, jazz”
Selected Categories“Music”, “jazz”, “Dixieland”
Deselected Category“Blues”
Portion of the resulting term vector:
music: 1.00, jazz: 0.44, dixieland: 0.20, tradition: 0.11,band: 0.10, inform: 0.10, new: 0.07, artist: 0.06
music: 1.00, jazz: 0.44, dixieland: 0.20, tradition: 0.11,band: 0.10, inform: 0.10, new: 0.07, artist: 0.06
32
Another Example – ARCH Interface
i Initial query = pythoni Intent for search = python as a snakei User selects Pythons under Reptilesi User deselects Python under
Programming and Development and Monty Python under Entertainment
i Enhanced query:
33
Experimental Evaluation
i Performed testing using a small subset of the Yahoo hierarchy4 Approximately 50 nodes4 Domains: Business and Economy, Computers and Internet,
Entertainment, Science
i Search scenarios4 Single keyword: python, bat, hardware, bug, mouse4 Two keywords: python snake, bat mammal, baseball bat, computer
hardware, home hardware, bug surveillance, computer mouse4 Selection and deselection of nodes based on specific search intent
i Ten signal documents and ten noise documents collected for each search scenarios4 indexed using tf.idf weights
34
Experiment Evaluation, cont.
i Two types of search
4 Simple Query SearchhUsed the user’s original query
4 Enhanced Query SearchhUsed the enhanced query that was generated by ARCHhSelection and deselection of nodes based on search scenarios
iMatching - cosine similarity measure
i Comparison - precision and recall
35
Results
Average precision for Enhanced Query versus Simple Query Search using two keywords
Enhanced ARCH Query vs. Simple Query Search
0.00.20.40.60.81.01.2
0 10 20 30 40 50 60 70 80 90 100
Similarity Threshold (%)
Pre
cisi
on EnhancedSimple Precision = proportion of retrieved
documents actually relevant
Average recall for Enhanced Query versus Simple Query Search using two keywordsEnhanced ARCH Query vs. Simple Query Search
0.00.20.40.60.81.01.2
0 10 20 30 40 50 60 70 80 90 100
Similarity Threshold (%)
Rec
all Enhanced
Simple Recall = proportion of relevant documents actually retrieved
36
User Profiles & Information Context
i Can user profiles replace the need for user interaction?4 Instead of explicit user feedback, the user profiles are used for the
selection and deselection of concepts4 Each individual profile is compared to the original user query for
similarity4 Those profiles which satisfy a similarity threshold are then compared to
the matching nodes in the concept hierarchyhmatching nodes include those that exceeded a similarity threshold
when compared to the user’s original keyword query. 4 The node with the highest similarity score is used for automatic
selection; nodes with relatively low similarity scores are used for automatic deselection
37
Generation of User Profiles
i Profile Generation Component of ARCH4 passively observe user’s browsing behavior4 use heuristics to determine which pages user finds “interesting”
htime spent on the page (or similar pages)hfrequency of visit to the page or the sitehother factors, e.g., bookmarking a page, etc.
4 implemented as a client-side proxy server
i Clustering of “Interesting” Documents4 ARCH extracts feature vectors for each profile document4 documents are clustered into semantically related categories
hwe use a clustering algorithm that supports overlapping categories to capture relationships across clustershalgorithms: overlapping version of k-means; hypergraph partitioning
4 profiles are the significant features in the centroid of each cluster
38
Results Based on User Profiles
Simple vs. Enhanced Query Search
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
0 10 20 30 40 50 60 70 80 90 100Threshold (%)
Re
call
Simple Query Single Keyword
Simple Query Two Keywords
Enhanced Query with User Profiles
Simple vs. Enhanced Query Search
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
0 10 20 30 40 50 60 70 80 90 100
Threshold (%)
Pre
cisi
on
Simple Query Single KeywordSimple Query Two KeywordsEnhanced Query with User Profiles
39
Future Work on ARCH
i Integrate the system with a specific search engine on the World Wide Web4Weighted term vector queries must be translated into Boolean queries
i Experiment using other concept hierarchies4 Specific domains (e.g., medical)4 Design is intended to be independent of specific concept hierarchy
i Fully integrate and evaluate the User Profile Modulei Additional information4 http://www.ahusieg.com/arch
40
Some Ongoing Projects at CWI
iWeb usage mining and Web analytics4 Intelligent techniques for sessionization and pageview identification4 Integration of usage, content, and structure4 Ontology-based Web usage mining (Semantic Web Usage Mining)
iWeb personalization4 Data mining techniques for personalization4 Integrating domain knowledge (ontologies) in user modeling4 Hybrid recommendation systems4 Secure Personalization
iWeb content mining / Information Agents4 Automatically learning domain-specific ontologies from Web sites4 Integrating domain ontologies, learned user profiles, and short-term user
interests for “contextual information access” on the WebhThe ARCH project