Mining web search behaviors: Strategies and techniques for data modeling and analysis

Mining Web Search Behaviors: Strategies and Techniques for Data Modeling and Analysis

Peiling Wang (corresponding author)School of Information Sciences, The University of Tennessee at Knoxville, Knoxville, TN 37996-0341 [email protected]

Dietmar WolframSchool of Information Studies, University of Wisconsin-Milwaukee, Milwaukee, WI 53201 [email protected]

Jin ZhangSchool of Information Studies, University of Wisconsin-Milwaukee, Milwaukee, WI 53201

Ningning HongSchool of Information Sciences, The University of Tennessee at Knoxville, Knoxville, TN 37996-0341

Lei WuSchool of Information Sciences, The University of Tennessee at Knoxville, Knoxville, TN 37996-0341

Craig CanevitSchool of Information Sciences, The University of Tennessee at Knoxville, Knoxville, TN 37996-0341

Daniel RedmonSchool of Information Sciences, The University of Tennessee at Knoxville, Knoxville, TN 37996-0341

There is a growing interest in modeling Web searching behaviors using query log data. In this project, we identified some gaps in current research. We propose to model Web search behaviors along three dimensions: interactions, linguistic and cognitive behaviors. We propound Web search session as a vital important concept to study interactive behaviors using query logs. A highly granular, comprehensive relational model is presented for data extraction and transformation along with strategies and methods for session identification. To facilitate analysis, we developed an interactive Web tool for exploring different session thresholds. We demonstrate statistically that the 80-20 empirical rule shows promise for setting session boundaries. In addition, we recommend that decisions for session boundary thresholds should be determined based on specific query corpus characteristics such as type and size of the database searched, and type of searchers who submit the queries. Our approach is based on the fact that data mining researchers do not always know all the hypotheses that the data can answer at the outset and the log data are diverse across environments due to the lack of standardization. This model maximizes transactional data inclusion, is flexible in handling data content, and can be extended easily to incorporate new hypotheses and new data elements as mining progresses.

Introduction

This paper presents part of phase one of a two-year IMLS funded research project1, entitled Modeling Web Searching Behaviors and Designing New

Effective Interactions for Digital Libraries. User studies have revealed that users have difficulties in searching traditional information systems such as

online public catalogs (Borgman, 1996); Web search environments differ from traditional IR and information seeking behaviors on the Web also deviate

from those expected in traditional IR environments (Yang, 2005). The purpose of this project is to develop a new interaction model that is less complex

than the traditional retrieval systems and more effective than simplistic Web search engines. To develop this model, we use data mining approach to

analyze Web query corpora collected from three different types of search engines and websites: general Web search (Excite), academic Website (UTK),

and subject Website (HealthLink). The methods and tools also can be used to analyze other Web search logs.

There is a growing interest in studying Web search behaviors using various methods to collect data, one of which is transaction logs of real searches of

Website or search engines. Although there is a body of literature on empirical studies of Web search transaction logs, few provide detailed

methodological clarifications on data models used and the underlying rationales for these models. Methodologies not only ensure valid synthesis of

empirical results across studies conducted in different environments, but also are important for sustained longitudinal research to observe trends over a

substantial period of time. This project provides a comprehensive model of high granularity with rationales and recommendations on important

decisions and potential elements to allow extensions for new hypotheses.

Related Studies

Research on Web information behaviors has been conducted in information science, computer science and other social sciences. The nature of Web

searching has made observations of individual Web users much harder to conduct than that of OPAC or traditional IR systems in earlier user studies. As

Jansen & Pooch (2001) pointed out, there has been a lack of standardization in terminology and methods in Web user studies. As a result, it is difficult

to synthesize the body of literature and compare results from different studies. A remarkable effort has been made to examine the fast growing body of

literature to bring major empirical results into a framework at a macro level (Spink & Jansen, 2004).

At a micro level, various analytical models and methods have been applied to mining search logs. Jansen (2006) provides a review of methodologies

and techniques for transaction log analysis (TLA). He defines TLA as a three-stage research process: collecting log data, preparing the data, and

analyzing the prepared data. The first two stages can have a big impact on the third stage and the results. Although the effort was made by the W3C to

recommend log file format (<http://www.w3.org/TR/WD-logfile.html >), there is still no standard or recommendation for the elements in query logs.

The most frequently extracted data by Web search studies include: query (the search statement), date/time, searcher or machine identification (IP

address, user-side cookie), click-through (date/time, URL, rank). The derived data include the token (term), search operator (Boolean, field, etc), and

statistics such as frequencies (query, token, etc). The second stage, preparing data, is performed to model and cleanse the logged data. This stage

comprises the most time-consuming and labor intensive work. Most published studies focused on reporting results, thus data models and methods were

not reported at a level enabling other researchers to validate results or replicate findings (Jansen, 2006). To date, we found four published relevant

relational models and will compare them in the following. The relational data model (Entity-Relation or ER diagram) in Wang, Berry & Yang (2003, p.

745) has clear semantics including four entities (tables) and four binary relationships connecting these entities. The four entities are Original Query,

Cleaned Query, Token, and Word Pair. This model does not include user identification data, which was not available in the log file. Jansen's relational

data model (2006, p. 414) includes three entities and two relationships (one unary and one ternary). The three entities are "searching_episode,"

"terms," and "cooc" (term pairs). The strength of this model is the normalization of the consecutive queries that are identical as a result of the searcher

viewing documents from search results. The subsequent queries were not submitted by the searcher, but were being logged again in the log file when

the searcher viewed a result page (Excite only displays certain records per results page). Baeza-Yates et al. (2005) presents an analysis-based relational

model (in Section 2.2). This model defines four entities and four relationships. Several entities are similar to the two models above. The unique entity is

the QuerySession, which is a derived entity by certain criteria (to be addressed below). Wolfram (2006) provides a simplified relational model for

analyzing transaction logs including three tables: query, term, and token; token being the relationship between query and term. This model was used as

a template for deriving various frequency distribution data from the tables using SQL. Although not intended to serve as a comprehensive model for

transaction log analysis, he demonstrated the utility of relational database tools for informetric data processing.

The above four models, implemented in two DBMSs (MS ACCESS and MySQL), were developed to be either data-driven (Jansen, 2006; Wang, Berry &

Yang, 2003) or specific question-oriented (Baeza-Yates et al., 2006; Wolfram, 2006). The data in Wang, Berry & Yang, (2003) were longitudinal across

four years while the other three models covered single days or three months. Some terminological confusion exists across these models. For example,

Jansen defined session interaction as an atomic action such as submitting a query, clicking a hyperlink, etc.; and searching episode as a series of

searching interactions, which is different from the session defined by Baeza-Yates et al. (2005, see below). Jansen (2006) defined search_url as "the

query terms as entered by the searcher" (p. 413) while Baeza-Yates et al. (2005) used Query instance, and both Wang et al. and Wolfram used Query to

name the same attribute. In addition, term, token and word have also been used interchangeably or differently. Standardization in terminology and

method is needed to advance this area of research and to consolidate results.

One of the challenges in analyzing server-side search logs is to identify the boundaries of a single session of interactions. (Silverstein et al., 1999)

Typical Web query log files may include interleaved searches from different searchers. Search logs by Internet providers may have subscriber's IDs in the

logs, but it is still difficult to disambiguate sessions if the searcher conducted multiple sessions for different topics or different searchers shared the

same account, which is typical in home networking settings. The treatment of sessions is the most diverse among all the studies. Spink and Jansen

(2004) have concluded most Web search sessions lasted about 15 minutes, with a substantial percentage lasting less than five minutes (p. 121).

Similarly, Göker and He (2002) suggested that an optimal session boundary interval was 11 to 15 minutes. Buzikashvili, N. and Jansen, B. J. (2006)

explore several thresholds. Baeza-Yates et al. (2005) used several criteria for sessions: excluding empty query instances, excluding queries without

document selection, and using a threshold value 15 minutes to define a session; that is, if a user (IP) submitted a query 15 minutes after the last click,

he started a new session. Murray, Lin and Chowdhury (2006), assume a minimum of 20 queries per identifier from which large gaps in inter-query times

(i.e., a long period of time between queries submitted by the same identifier) are located to indicate session boundaries. Based on the number of

queries associated with each identifier data for two datasets, the average number of queries associated with each identifier was far below this

threshold, with fewer than one percent of queries by a given identifier qualifying.

An alternative approach, based on subject analysis of adjacent queries may detect a user session as illustrated in Wang, Berry, and Yang (2003). This

approach was mostly carried out manually and is difficult to perform automatically. It is impractical for large datasets, unreliable for short queries, and

does not take into account that users may engage in multiple search topics in a given session. As this paper is being revised, a new study (Jansen et al.,

2007) has compared three methods for identifying sessions: IP address and cookie; IP address, cookie, and temporal limit; and IP address, cookie and

query patterns. The authors also defined two concepts about a session: Session Length as the number of queries and Session Duration as the period

between first query and last query.

Data Model and Rationales

Designing a data model for a data mining application is different from designing a data model for an operational database. The latter focuses on

operational efficiency and minimum redundancy. The former involves data extraction, transformation, and new data loading to discover hidden trends

and patterns. Mostly, the transaction log researchers work like an astronomer1 who is concerned with acquiring data, which involves building and

maintaining instruments, as well as processing the results. Often new hypotheses are generated during the process of answering current hypotheses.

This is a major difference from typical user studies that define research questions first and collect data to answer just these questions. What the

transaction logs can offer goes beyond what we know that we do not know. Therefore, we must build an efficient instrument that can be shared among

researchers. A database or data warehouse is the right tool for handling vast amounts of Web log data.

Setting the goals of sharing research data and making longitudinal systematic observations, we present a data model that has high granularity and is

extensible to accommodate new hypotheses. This model will allow us to explore thresholds for session boundaries at the time of data analysis rather

than determining a threshold at the time of data processing (see Related Studies). The basic assumption we make is that a session consists of a series

of queries and clicks from a searcher in order to find information to satisfy a specific information need. Our purpose for identification of sessions, thus,

is important to subsequent analysis to model users' interaction behaviors. Queries submitted in a session characterize the interactions carried out by the

searcher. Session-based analysis can provide an understanding of three dimensions of user behaviors: (1) interaction behaviors by analyzing the length

of a search session, the number of reiteration, and the manipulation of results; (2) linguistic behaviors by analyzing the queries representing needs, the

subsequent queries revising the original query, and structural variations of the queries; (3) cognitive behaviors such as cognitive structures underlying

the search behaviors, which moves analysis beyond the linguistic level to identify concepts and conceptual moves in the session. The following are the

strategies and methods proposed.

The model is depicted in Figure 1 as an ER diagram, which includes four entities and three relationships between these entities. In the data warehouse,

only six tables are used to implement the model: the four entities are implemented as four tables Query, Query_uniq, Token_uniq, and WebPage; the

relationship between Query and Query_uniq is many to one, thus it is sufficient to map the primary key of the Query_uniq table into the Query table as a

foreign key; the relationship between Query_uniq and Token_uniq is many to many, which is implemented as the table Query_Token_uniq; the

relationship between Query and WebPage is many to many, which is implemented as the table Click.

For the purpose of this paper, several tables less central to the model are not depicted here for simplicity. For example, the table for individual sites to

which a query was submitted has been omitted in the diagram but is linked to Query. The derived entities such as term pairs with mutual information

and the introduced entities from WordNet (lexicon tool) have also been omitted in Figure 1. These can be connected to the entities above as needed. The

database was built using SQL Server 2005 running on Windows Server 2003. Brief descriptions along with rationales of the data elements are given as

follows:

Query table, in the middle of the diagram, stores extracted data from log files. Each transaction log is assigned a unique identifier QID as a

primary key. Timestamp is parsed into Y(year), M(month), D(day), T(time in the format of hh:mm:ss), and TimeS(time in second), which is derived

using the formula: (hh * 60 + mm) * 60 + ss. TimeS is needed for processing efficiency. For example, the inter-query time difference is calculated

using this field (Figure 6). Hit is the number of WebPages retrieved. NumSite is the number of websites to which a query is submitted in the case

of federated search. IP is an anonymous number replacing the real IP address (see Figure 6). Query_raw stores the original query without

cleansing. The attribute groupID is generated by a procedure to identify all submitted queries from a specific IP address on a specific day. This

attribute is important to identify a potential search session and is further illustrated in the following sections. Qid_uniq is a foreign key referencing

the primary key in table Query_Uniq.

1.

Query_uniq is a derived table that stores unique queries by grouping identical queries. In the entire query corpus, identical queries were repetitively

submitted either from the same searcher or from different searchers. To remove the duplicated queries, the SQL statement "group by" was used

to identify unique queries to update this table along with three aggregated attributes for structural features: the number of tokens (NumWord) and

the number of characters including intervining spaces (NumChar), and the number of query occurrences in the corpus (Freq_query_raw,), which

measures popularity of a query.

2.

The rationale for introducing this table is that unique queries are the linguistic expressions of searchers' information needs and should not be

influenced by the popularity or frequency of the query occurrences. In linguistic analysis, using the Query_uniq table is computationally more

efficient than using the Query table.

Figure 1. Data model for Web query mining

Token_uniq is a table derived during tokenizing or parsing of the unique queries. This table includes valid words as well as misspelled words or

non-ASCII characters. The token is stored in the field String that is also a primary key (35 characters based on the corpus). According to traditional

database design, a short numerical ID (auto number) would be used for the primary key. Our decision is to minimize the need for joining tables in

querying and to also avoid duplications as a result of identical strings assigned different numerical numbers. Two types of frequencies are

recorded: by occurrences in the corpus (Freq_word) and by occurrences in query (Freq_query).

3.

Query_Token_uniq is a table for the relationship between Query_uniq and Token_uniq. To maintain the original query structure, the attribute

position is used to specify where the token was located in the query. For example, the query "football games schedule" with QID "494" is parsed

into three tokens. In this table, there are three records: (494, football, 1), (494, game, 2) and (494, schedule, 3). Because of repetitive tokens in

the same query, position is also set as a partial key.

4.

Click is a table for the relationship between Query and WebPage. In other words, it stores the data on click-through actions: when and which

webpage the searcher clicked on after a query having returned results. The timestamp (Y, M, D, T) for the click has a lag behind of the query's

timestamp (Y, M, D, T). Rank in this table is the position of the clicked document on the output list, which is ordered by relevance scores. The clink

table is linked to Query table.

5.

WebPage table assigns an identification number to each long URL. The frequency of the clicked webpage, as a measure of popularity, is stored as

the derived attribute, Freq.

6.

This model includes the data elements from three log files: access log (by the server) and, query log and click log (both by the search engine). Raw data

were cleaned and parsed into the above six tables with minimum redundancy. Derived attributes are stored to improve computational efficiency.

Due to space limitations, we do not include parsing algorithms in this paper. We simply share our experiences on the difficulties we have encountered.

For the Excite query corpora used, the data were obtained on a single day, thus there was no log format change. The UTK query corpus, as a contrast,

includes the log transactions across multiple years, during which several changes occurred in either format or elements (the software was upgraded and

enhanced). When changes occur in log files, both parsing programs need to be revised and the data model modified to include new data elements that

are potentially useful for new discovery.

Techniques for Identifying Sessions

First, several important concepts must be defined for conducting analysis based on sessions. Second, the algorithms are provided for identifying and

implementing sessions. The application of 80-20 rule for identifying reasonable threshold values to set session boundaries across the three query

corpora are illustrated.

Concepts

Group is an artificial boundary. Each group includes a set of consecutive queries from the same IP address (or other machine identifier) on the same

day. A group is identified by a groupID (Figure 1).

Session is an artificial boundary that defines a set of one or more consecutive queries attributed to the same identifier where the time between adjacent

queries does not exceed a cutoff value (see below). During the experimental stage, sessionID is not stored because sessions are dependent of the

threshold value.

Cutoff is a chosen value for marking session boundaries. A cutoff value is a threshold value for query intervals (see below). Two queries belong to a

same session if the query interval value is less than the cutoff value. Otherwise, the two queries belong to two adjacent sessions.

Query interval (Δti) is the time difference (also called time lag) between two consecutive queries in a group. It is calculated using the formula:

Algorithms

Three algorithms are used to process data for subsequent session identifications. Algorithm 1. Grouping

For each day

group queries by IP number

assign a groupID to all the queries from the same IP

sort the queries within each group by the timestamp (TimeS)

The original IP addresses have been mapped to anonymous numbers in the database to avoid identifying the searchers. Two kinds of IPs exist: static

and dynamic. A problem associated with the dynamic IP scheme is that identifying a single searcher is difficult. However, after looking closely at the

queries in each group, we observed some similarity in content and regularity in query intervals. The following strategies enabled us to move forward to

identify sessions.

Algorithm 2. Deriving query intervals

For each group with m queries

assign Δti = 0 for i = 0

calculate Δti+1 = T(qi+1) - T(qi), if m >= 2 (i = 1, ... m-1)

Figure 2. Distribution of query intervals for a Sample of the UTK Dataset

We suggest making cutoff decisions based on experimental results of the query corpus rather than adopting the values from published studies because

each Website or search engine may have different searchers and system features, which may affect interactions and session lengths. The observed

distribution of query intervals follows a Poisson-like shape, with an extremely long tail (see Figure 2). It is this tail that becomes problematic for session

boundary determination. Statistically, applying the 80-20 rule, whereby the time value representing the 80th percentile of the distribution of query

intervals may be a reasonable basis for deciding a threshold value for the cutoff between sessions. It is used in several environments and serves as an

easily applied rule of thumb for other social phenomena with observed inverse relationships and takes into account the distribution characteristics of

each dataset.

To determine the effect of different cutoff points on the average session length, query intervals representing the 10th through 90th percentiles of values

were selected with the resulting average session length calculated for each dataset. When graphed using a semi-log plot, there is a notable increase for

average session lengths over short query interval increases, but these increases become much smaller with larger cutoff times. The plots for each

dataset appear in Figures 3, 4 and 5. The average session values take on an S-shape, with the inflection point occurring high in the distribution. These

inflection points signify a marked slowdown in the increase of the average session size, and occur around or just before the 80th percentile. There is

some flexibility with this decision in that does not greatly impact outcomes. For example, the differences in time between the 70th and 80th percentile

cutoffs for the datasets used represent a two to five-fold increase in time, but only result in approximately a 10% to 17% increase in the average session

lengths.

Algorithm 3. Sessioning

input a cutoff value

for each group with more than one query (m>=2)

sort the group by TimeS

assign a new sessionID, if Δti+1 >= cutoff

sessions may have only one query

All the algorithms above are implemented using SQL query and procedures. The queries and procedures will be published in a separate paper.

Figure 3. Average Session Size for Different Cutoff Values - Excite Dataset

Figure 4. Average Session Size for Different Cutoff Values - UTK Dataset

Figure 5. Average Session Size for Different Cutoff Values - HealthLink Dataset

Experiments on Thresholds using an Interactive Tool

An interactive tool has been designed to explore session boundaries. Different cutoff values may be entered to generate different sessions. Figure 6

illustrates that when the cutoff value is set to 5 minutes, the group of 12 queries from the same address on September 4, 2004 is divided into two

sessions. If the cutoff is set to greater than 5 minutes and 8 seconds, this group will be considered to be one session. This interactive tool is accessible

at http://aquamarine.sis.utk.edu

Figure 6. An Interactive Tool for Exploring Session Boundaries

This tool does not consider sessions that may span multiple days. In fact, for the UTK query corpus, few queries were submitted from the same IPs on

the same day between 5 minutes before and 5 minutes after the midnight.

It makes sense to treat this group of queries together as one search session. We can further analyze the searcher's interaction behavior. The search

session lasted 19 minutes 52 seconds with a total of 12 queries. From the Click table using a join query statement (Figure 1), we selected the related

click-through instances for individual search queries (Table 1). The searcher followed links from results page after six of the 12 queries. During the

session, the searcher visited 7 different WebPages (identified by URLs) for a total of 14 times. Three of the 7 pages were revisited between 1 to 3 more

times. The visited WebPages ranked between 1 and 10, the mean rank is 3.25 (based on unique ranks). No WebPage was visited after the last query. It

is likely that the search session ended without finding needed information.

The searchers went through various reiterations of the queries (Figure 6): (a) from 2 to 3, the term "Football" was added; (b) from 3 to 4, the term

"2004" was added and three webpages were visited; (c) from 4 to 5, the query was limited to title search and three webpages were visited; (d) from 5 to

6, the term "Games" dropped in title search and four webpages were visited; (e) from 6 to 7, the term "Schedule" was dropped; (f) from 7 to 8, a change

to title phrase search and the term "Schedule" was back; (g) from 8 to 9, dropped both title limit and quotation mark; (h) from 9 to 10, the term "2004"

was changed to "Official" and one webpage was visited; (i) from 10 to 11, the 9th query was submitted after capitalizing the terms and two webpages

were visited; (j) from 11 to 12, three terms "University of Tennessee" were inserted between "2004" and "Football". The session then ended. Although

the reiterations seemed systematic and the searcher knew advanced search features, the transactions show little understanding of Web information

retrieval. None of the URLs was closer to an official Website for the university's football game schedule for 2004, which led to the 10th query adding the

term "Official." Many of the URLs had clues about what the webpages might be (personal page, Gibson school football schedule, etc.). This is a simple

information need with a fruitless search session for almost 20 minutes. Help was needed.

Analysis of linguistic features of the 12 queries shows that 8 terms were used during the search session: football, 11 times; schedule, 10 times; 2004, 8

times; Games, 5 times; and once for each of the rest: University, Official, Tennessee, of. Using term co-occurrence data, a semantic map may be

constructed to reflect the searcher's conceptual structure representing the information need.

Table 1. Click-through behavior

RowQuery

timeRank

Click

timeWebPage URL

3 19:14:20 10 19:14:53http://web.utk.edu/~dharris4/ut_football_schedule.htm

4 19:15:58 2 19:16:15 http://www.cs.utk.edu/~heinrich/ncaa/ncaa.php


4 19:15:58 3 19:16:59http://bioengr.ag.utk.edu/asae/photo_gallery/2003-2004/football/index.htm

5 19:17:52 3 19:18:15http://web.utk.edu/~dharris4/ut_football_schedule.htm

5 19:17:52 3 19:18:38 http://web.utk.edu/~dhouston/page2.html


6 19:20:10 1 19:20:20 http://web.utk.edu/~dharris4/ut_football_schedule.htm

6 19:20:10 1 19:20:34 http://web.utk.edu/~dhouston/page2.html

6 19:20:10 2 19:20:46http://volweb.utk.edu/school/gibson/yville/Football%20Schedule.htm

6 19:20:10 3 19:21:06 http://www.cs.utk.edu/~balajee/schedule.html

1

(10)19:29:40 2 19:29:58

http://web.utk.edu/~dharris4/ut_football_schedule.htm

2

(11)19:29:40 3 19:30:13

http://volweb.utk.edu/school/gibson/yville/Football%20Schedule.htm

2

(11)19:31:02 1 19:31:17

http://www.utm.edu/departments/acadpro/library/departments/special_collections/archive/48_2_52_football.htm

Note: This table corresponds to Figure 6. Queries may be found by the row number in Figure 6.

Conclusions

We propose that Web searching should be examined along three dimensions: interaction, linguistic and cognitive behaviors. In this project, we identified

several gaps in current Web query log research. The current models for processing the logged data vary across projects whose researchers take

hypothesis-driven or data-warrant approaches. A comparison of these diverse models reveals some terminological confusion and diversity. There is a

need for standardization of methods and terminology. A critical concept deserves special attention and imposes challenges in data analysis. We

propound Web search session as a vital important concept to advance research on Web search behaviors using query logs. Our attempt to define this

concept is made at both the theoretical level (modeling from three dimensions) and technical level (algorithms and techniques).

We have developed a highly granular, comprehensive relational model for data extraction and transformation, provided associated strategies and

methods for session identification, and an interactive Web tool for exploring different thresholds for session identification. Our approach is based on the

fact that researchers in this line of research do not always know all the hypotheses that the log data can answer at the outset and the log data are

diverse across environments due to the lack of standardization. Thus a more inclusive and extensible model beyond individual projects is needed so that

new hypotheses can be studied as they arise during data analysis. This is an important distinction from typical user studies that start with research

questions and collect data to address these questions. Data mining of the vast amount of log data may be more like research in astronomy in that

researchers develop instruments that can help discover knowledge from real data.

The model has two important new features that can advance current Web query mining: (1) the derived entity Query_uniq that aggregates repetitive

identical queries to enhance computational efficiency in linguistic analysis and to model conceptual structures of information needs; (2) the groupID as

a device for session identification that allows flexibility in adopting different thresholds for different query corpora (as illustrated using the interactive

Web tool). Session identification is vital for further analysis along three dimensions: interaction, linguistic, and cognitive behaviors. Focus on

transactions within a session will allow further analysis to understand how the searcher reiterated from an initial query to its subsequent queries, which

are overt behaviors. The queries are seen as linguistic expressions of information needs. Within a session, the queries can be analyzed at a linguistic

level to reveal how the queries have changed (case, part of speech, word variation, word order, misspelling, etc.). Underlying query reiterations there are

cognitive factors: if a query was refined to add a narrower concept, the searcher had activated the associative concepts in memory and further

elaborated his/her information need. If a searcher randomly reiterated queries, his/her knowledge structure is less coherent and more toward an

anomalous state of knowledge (Belkin, Oddy, & Brooks, 1982).

We propound that a decision of thresholds be based on specific corpus characteristics. The application of the 80-20 empirical rule shows promises as a

good candidate for finding the cutoff for session boundaries of specific corpus without having to analyze individual sets of queries.

The data model and strategies for session identification provide the methods and techniques for modeling Web search behaviors along three

dimensions. Despite some limitations, Web search logs capture real searches and provide valuable information on real interactions. In a separate paper,

we report preliminary results on quantitative clustering of sessions to model interaction behaviors (Wolfram, Wang, & Zhang, 2007). In Phase Two of the

project we will conduct linguistic analysis to cluster queries semantically using lexicon tools.

Current Web searchers will bring their mental models to new systems. A potential implication of this research for Web 2.0 is that by identifying better

conceptual structures among Web searchers of similar information needs, we can design more effective and efficient tools for Web-enabled

collaborative information spaces.

Note: Gheorghe Muresan made this analogy during a panel session at the 2005 ASIST annual meeting.

Acknowledgements

This work is funded partially by the Institute for Museum and Library Services, the National Leadership Research grant (LG-06-05-0100-05). Any views,

findings, conclusions or recommendations expressed in this paper do not necessarily represent those of the Institute of Museum and Library Services.

We are thankful for the Web query corpora from UTK Website, Health-link and Excite@home. The first author wishes to acknowledge the support of

2005 OCLC/ALISE Research Grant Award and the following individuals: Ed Cortez and Carol Tenopir for encouragements, John Rose, Cindy Lancaster,

David Ratledge, and Matt Grayson for technical consultations over the years, Vuttichai Chatpattananan and Jason Rieger for writing parser programs.

References

Baeza-Yates et al. (2005) Modeling user search behavior. Proceedings of the Third Latin American Web Congress. IEEE (0-7695-2471-0/05) 31

Oct.-2 Nov. 2005 Page(s):10 pages.

Belkin, N.J., Oddy, R.N. & Brooks, H.M. (1982) ASK for information retrieval. Part I: Background and theory Journal of Documentation, vol. 38,

nos. 2: 61-71.

Borgman, C. L. (1996). Why are online catalogs still hard to use? Journal of the American Society for Information Science 47(7), 493-503.

Buzikashvili, N. & Jansen, B. J. (2006) Limits of the Web Log Analysis Artifacts. Workshop on Logging Traces of Web Activity: The Mechanics of Data

Collection. The Fifteenth International World Wide Web Conference (WWW 2006). 22-26 May. Edinburgh, Scotland.

Göker, A., & He, D. (2002). Analysing Web search logs to determine session boundaries for user-oriented learning. Proceedings of the International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems (pp. 319-322). London: Springer-Verlag.

Jansen, B. J., Spink, A., Blakely, C., & Koshman, S. (2007) Defining a session on Web search engines. Journal of the American Society for Information Science and Technology. 58(5): 862-871.

Jansen, B. J. & Pooch, U. 2001. Web user studies: A review and framework for future work. Journal of the American Society for Information Science and Technology. 52(3), 235 - 246

Jansen, B. J. 2006. Search log analysis: What is it; what's been done; how to do it. Library and Information Science Research, 28(3), 407-432

http://ist.psu.edu/faculty_pages/jjansen/academic/pubs/jansen_search_log_analysis.pdf

Murray, G.C., Lin, A., & Chowdhury, A. (2006). Identification of user sessions with hierarchical agglomerative clustering. Proceedings of the ASIS&T Annual Meeting [CD-ROM]. Medford, NJ: Information Today, Inc.

Silverstein, C., Marais, H., Henzinger, M., & Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1 (Sep. 1999),

6-12. http://doi.acm.org/10.1145/331403.331405

Spink, A, & Jansen, B. J. (2004). Web search: Public searching of the Web. Kluwer

Wang, P., M. W. Berry., & Y. Yang (2003). Mining longitudinal Web queries: Trends and patterns. Journal of the American Society for Information Science and Technology. Vol. 54, Issue 8: 743-758. http://web.utk.edu/~peilingw/publications/

Wolfram, D. (2006). Applications of SQL for informetric frequency distribution processing. Scientometrics, 67(2), 301-313.

Wolfram, D. Wang, P., & Zhang, J. (2007). Modeling Web session behavior using cluster analysis: A comparison of three search settings. In Proceedings of the 2007 Annual Meeting American Society for Information Science and Technology.

Yang, K. (2005). Information Retrieval on the Web. Annual Review of Information Science and Technology, 39, 33-80.

Mining web search behaviors: Strategies and techniques for data modeling and analysis

Documents

Transcript of Mining web search behaviors: Strategies and techniques for data modeling and analysis