CIKM Tutorial 2008

40
Web Search Log Analysis and Web Search Log Analysis and User Behavior Modeling: A User Behavior Modeling: A Tutorial Tutorial ACM 17th Conference on Information and Knowledge Management Napa Valley, CA October 26, 2008 Peiling Wang and Lei Wu (UT) Dietmar Wolfram (UMW)

Transcript of CIKM Tutorial 2008

Page 1: CIKM Tutorial 2008

Web Search Log Analysis and User Web Search Log Analysis and User Behavior Modeling: A Tutorial Behavior Modeling: A Tutorial

ACM 17th Conference on Information and Knowledge Management

Napa Valley, CAOctober 26, 2008

Peiling Wang and Lei Wu (UT)

Dietmar Wolfram (UMW)

Page 2: CIKM Tutorial 2008

Successful FundingSuccessful Funding

UT Interdisciplinary Research Grant $14,600 (2001)

UT Research Grant $5,000 (2003) OCLC/ALISE Research Grant Award

$15,000 (2005) IMLS National Leadership on Research

$200,000 (2005-2008)

Page 3: CIKM Tutorial 2008

PART I. PART I. DATA MODELDATA MODEL& & WEB SEARCH BEHAVIOR WEB SEARCH BEHAVIOR MODELING MODELING

Page 4: CIKM Tutorial 2008

Three Query CorporaThree Query Corpora

Academic site:

4,597,478 queries (10/2002—01/2005)

[example log files]

Health information site:

377,701 queries (2005)

Excite search engine:

435,627 (1999); 450,199 (2001)

Page 5: CIKM Tutorial 2008

Server and Web Engine LogsServer and Web Engine LogsACCESS.LOG

IPNum Date Time Query&Sites Machine

207.203.188.xx [06/Aug/2003:10:35:42 -0400] "GET query.html?col=utc&col=utia&…&tsi

&qt=where+is+tca+law+on+release+of+lien%…” "Mozilla/4.0 (compatible; MSIE 6.0;

Windows 9”

QUERY.LOG

Date Time Hits Sites Query

2003/08/06 10:35:42 481856 utc,utia,…,tsi u'where is tca law on release of lien?'

CLICK.LOG

Date Time Action Query Sites URL Rank

2003/08/06 10:36:33 click u’where is tca law on release of lien?’ utk,utia, …tsi

http://web.utk.edu/~ereagan1/TCA Final Exam Notes.doc 3

Page 6: CIKM Tutorial 2008

Methodological NotesMethodological Notes

natural dataanalogy to astronomer’s worknew hypotheses along the wayknowledge discovery through

mining dataa bottom up approach (requires a

good data model)

Page 7: CIKM Tutorial 2008

Logical data model (Relational)Logical data model (Relational)

Several models used in query analysis: Question-oriented

Wolfram (2006) Baeza-Yates (2006)

Data-driven Jansen (2005) Wang, Berry & Yang (2003)

Granularity (low - high)

Page 8: CIKM Tutorial 2008

Click *QID

UID

Year

Month

Day

Time

TimeS

Rank

Query *QID

Year

Month

Day

Time

TimeS

Hit

NumSite

IP

query_raw

groupID

QID_uniq

Query_ Token_ uniqQID_uniq

String

Position

Query_ uniqQID_uniq

query

NumWord

NumChar

Freq_query_raw

Token_ uniqString

Length

Freq_query

Freq_word

WebPageUID

URL

Freq

Figure 1 Data Model

Lexicontool

Page 9: CIKM Tutorial 2008
Page 10: CIKM Tutorial 2008

sense *word

meaningID

word *word

num_meaningID

synset *meaningID

definition

morphrefword

morph

pos

Stop_wordS_Word

WordNet implemented in Relational Database

Figure 2 Lexicon Tool

Page 11: CIKM Tutorial 2008

Modeling behaviorsModeling behaviors

Corpora-based (site users as a whole) popular queries information needs document access (clicked URLs) query characteristics

words co-occurrence

Session-based (individual searches) interactions (reiteration of queries …) search topics clustering sessions to identify patterns

Page 12: CIKM Tutorial 2008

Top QueriesTop Queries

UTK HealthLink Excite

BlackBoard 74316 urinary+tract+infection 3947 yahoo 2523

Enter search terms 45564 pregnancy 2788 sex 2258

housing 45270 breast+cancer 2443 horoscopes 1249

circle park 32552 diet 2297 hotmail 1121

registrar 28284 interstitial+cystitis 2271 maps 1100

tuition 26312 2144 weather 963

career services 21327 blood+clots 1952 games 943

bookstore 20507 breast+self-exam 1867 ebay 918

timetable 20207 shingles 1844 porn 861

transcripts 19436 bulemia 1740 las+vegas 840

Page 13: CIKM Tutorial 2008

Enter search terms

Go

A problem identified

Page 14: CIKM Tutorial 2008

Seasonal Information NeedsSeasonal Information Needs

"football" Related Query Distribution

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0 2 4 6 8 10 12 14

Per

cen

tag

e

2003

2004

Month

Figure 3

Page 15: CIKM Tutorial 2008

Top 10 WordsTop 10 Words

UTK HealthLink Excite

of 54669 of 4629 and 97967

and 35547 and 3096 of 26434

for 26849 in 2813 the 17711

the 18986 the 2058 in 15368

in 16900 for 1893 free 15136

student 15604 disease 1706 for 10411

ut 14201 cancer 1315 to 7024

2003 12975 blood 1259 pictures 7000

2004 12683 to 1255 new 6991

to 11410 syndrome 1235 nude 5951

Page 16: CIKM Tutorial 2008

What can a search session tell?What can a search session tell?

Information needs (queries) representation of such needs cognitive (knowledge structure) topics searched linguistic features

Interactions (http://aquamarine.sis.utk.edu/top/session_ex.htm)

moves from an initial query to subsequent queries clicks (viewing results or accessing documents) search strategies

Page 17: CIKM Tutorial 2008
Page 18: CIKM Tutorial 2008
Page 19: CIKM Tutorial 2008

Information Needs and Knowledge Information Needs and Knowledge StructureStructure

2004 University of Tennessee Football Schedule

Figure 4. Drawing a Semantic Network of A Single Search Session

2004

football

Schedule

game

9

6

7

3

5

UT

1

11

Page 20: CIKM Tutorial 2008

Building Semantic NetworksBuilding Semantic Networks

Corpus-based or session-based Word co-occurrence for links Use an algorithm to set a threshold for

the boundaries of the semantic network The threshold may vary for different

clusters

Page 21: CIKM Tutorial 2008

PART II. PART II. QUANTITATIVE CLUSTERINGQUANTITATIVE CLUSTERING& & QUALITATIVE CLUSTERINGQUALITATIVE CLUSTERING

Page 22: CIKM Tutorial 2008

Challenge: Identifying Web Search Challenge: Identifying Web Search SessionsSessions

Server-side transaction logs

Queries from different IPs interweaving

Dynamic IP address

Shared computers

Session boundaries are unidentifiable

Sessions for analysis of interactions

Page 23: CIKM Tutorial 2008

Defining Search SessionsDefining Search Sessions

An artificial boundary

A set of consecutive queries submitted from the same identifier (IP address, cookie, user account) within a reasonable time interval (cutoff value)

Session boundaries dependent of cutoff value (threshold)

What is a reasonable cutoff?

Page 24: CIKM Tutorial 2008

Experimenting Experimenting CutoffCutoff

Query interval (∆ti) is the time difference (also called time lag) between two consecutive queries from the same identifier:

queryith theof timestamp theis )T(q where),T(q– )T(q

0

ii1iit

Page 25: CIKM Tutorial 2008

Experimenting Different Cutoffs (Healthlink dataset)

Figure 5

Page 26: CIKM Tutorial 2008

Session variables (Means)Session variables (Means)

Length (size): Number of queries Query length: Number of terms Term popularity: corpus-based term F Query interval: timelag between two

consecutive queries (Duration: timelag between first and last

queries) Term reuse: session-based term f

Page 27: CIKM Tutorial 2008

Figure 6 Visualize Clusters

Page 28: CIKM Tutorial 2008

Figure 6 interpretedFigure 6 interpreted

C1 “hit and run”: brief sessions, short query intervals, few terms, less popular terms

C2 “focused search”: long queries; popular vocabulary

C3 “struggling search”: long sessions, long query intervals, re-use of terms in subsequent queries

Page 29: CIKM Tutorial 2008

Clustering 2-steps MethodClustering 2-steps Method

1. Session variables (see above) export from database as delimited file. Each session is represented as a record.

2. Raw data is imported to SPSS

3. TwoStep cluster analysis(standardize data)

Page 30: CIKM Tutorial 2008

Session Raw Data and Session Raw Data and NormalizationNormalization

Page 31: CIKM Tutorial 2008

Clusters validationClusters validation

Divide each dataset into two or more subsets of sessions to determine if similar clustering outcomes are produced

Longitudinal samples:Each academic year consists of three quarters: FallSpringSummer

Page 32: CIKM Tutorial 2008

Beyond Quantitative Clustering: Beyond Quantitative Clustering: Conceptual AnalysisConceptual Analysis

Conceptual level synonyms association (mutual information)

Semantic level hyper-hypo relationship

May also include structure level

Page 33: CIKM Tutorial 2008

Clustering User QueriesClustering User Queries

different queries may represent the same or similar information needs

a set of queries may look for the same information

clustering based on similarity (distance) word level (symbolic, morph) concept level (synonym, association) semantic level (hierarchical relationship)

Page 34: CIKM Tutorial 2008

Similarity scoresSimilarity scores

Simw = ∑ TokenScore / Max(Q1(UniqueWords), Q2(UniqueWords))

Simc = ∑ ConceptScore / Max(Q1(UniqueWords), Q2(UniqueWords))

Sims = ∑ SemanticScore / Max(Q1(UniqueWords), Q2(UniqueWords))

Page 35: CIKM Tutorial 2008

Wq1 Wq2 Token Concept Semantic

class course 0 1 1

class timetable 0 0 0

schedule course 0 0 0

schedule timetable 0 0 0.5

0 0.5 0.75

Similarity at Three Levels Similarity at Three Levels

Q1 class schedule Q2 timetable of courses

at conceptual level, we may use synonym as well as normalized word association value, thus the three pairs (class, timetable; schedule, course; schedule timetable) may not be scored to “0”

Page 36: CIKM Tutorial 2008
Page 37: CIKM Tutorial 2008
Page 38: CIKM Tutorial 2008
Page 39: CIKM Tutorial 2008

WordNet and beyondWordNet and beyond

A useful tool with limitations Expansion of vocabulary is needed to

include local vocabulary Hierarchical relationship needs

improvement Incorporate associative relationship

Page 40: CIKM Tutorial 2008

Thank you!Thank you!

Questions?