Chapter 7Web Usage Mining
Part I
L. Malak Bagais
It’s main goal is to:Discover usage patterns from web data in
order to understand and better serve the needs of web based applications
Web Usage Mining
Web usage mining consists of three phases Preprocessing Pattern discovery Pattern analysis
Web Usage Mining
Generated by users’ interaction with the Web, data sources include:
web-server access logs proxy-server logs browser logs user profiles registration data user sessions and transactions cookies user queries bookmark data mouse clicks and scrolls
Web-Usage Mining
A server log: set of files consisting of the details of an activity performed
by a server files are automatically created and maintained by the
server The World Wide Web Consortium (W3C) has specified
a standard format for web-server log files There are other proprietary formats for web-server
logs.
Web-Log Processing
Most web logs contain: IP address of the client making the request date and time of the request URL of the requested page number of bytes sent to serve the request user agent (such as a web browser or web crawler) referrer (the URL that triggered the request)
Logs can all be stored in one file A better alternative is to separate:
access log error log referrer log
Web-Log Processing
Common log format(http://www. W3.org/Daemon/User/Config/Logging.html#common-logfile-format)
Format of Web Logs
140.14.6.11 - pawan [06/Sep/2001:10:46:07 -0300] "GET /s.htm HTTP/1.0" 200 2267
140.14.7.18 - raj [06/Sep/2001:11:23:53 -0300] "POST /s.cgi HTTP/1.0" 200 499
GET request that retrieves a file s.htm POST request sends data to a program s.cgi Fields:
client machine’s IP address (140.14.6.11) RFC 1413 identity of the client is missing (-) Date and time Request Error code Number of bytes transferred
Examples of Common Log Format
An example of a log file in extended format
Examples of Common Log Format
#Version: version of the extended log file format used
#Fields: fields recorded in the log#Software: software that generated the log#Start-Date: date and time at which the log
was started#End-Date: date and time at which the log
was finished#Date: date and time at which the entry was
added#Remark: Comments that are ignored by
analysis tools
Format of Web Logs
The directives #Version and #Fields are mandatory and must appear before all the entries
Each field in the #Fields directive can be specified in one of the following ways: an identifier; e.g., time an identifier with a prefix separated by a hyphen; e.g.,
cs-method a prefix following a header in parentheses; e.g.,
sc(Content-type)
Format of Web Logs
No prefixes for date, time, time-taken, bytes, cached
Prefixes for ip, dns, status, comment, method, uri, uri-stem, uri-query, host
Prefixes can be:cs client to serversc server to clientsr server to remote server (this prefix is used by proxies)rs remote server to server (this prefix is used by proxies)x application-specific identifier
Format of Web Logs
Analyzing Web logs
General Summary from Analog
Analyzing Web Logs
Monthly report from Analog
Analyzing Web Logs
Daily summary from Analog
Analyzing Web Logs
Hourly summary from Analog
Analyzing Web Logs
Organization report from Analog
Search-word report from Analog
Operation-system report from Analog
Status-code report from Analog
File size report from Analog
File type report from Analog
Directory report from Analog
FRequest report from Analog
Analysis of Clickstream: Studying Navigation Paths
Clickstream using Pathalizer with seven link specification
Analysis of Clickstream: Studying Navigation Paths
Clickstream using Pathalizer with twenty link specification
Analysis of Clickstream: Studying Navigation Paths
A brief on-campus session identified by StatViz that browses the bulletin board
Visualizing Individual User Sessions
A brief off-campus session identified by StatViz with three distinct activities
Visualizing Individual User Sessions
A long on-campus session identified by StatViz with multiple activities
Visualizing Individual User Sessions
Requests may not always reach the server as they may be served from a proxy server’s cache
You do not really know: Identity of readers Number of visitors Number of visits User’s navigation path through the site Entry point and referral How users left the site or where they went next How long people spent reading each page How long people spent on the site
Caution in Interpreting Web-Access Logs
I’ve presented a somewhat negative view here, emphasizing what you can’t find out. Web statistics are still informative: it’s just important not to slip from “this page has received 30,000 requests” to “30,000 people have read this page.” In some sense these problems are not really new to the web—they are present just as much in print media too. For example, you only know how many magazines you’ve sold, not how many people have read them. In print media we have learnt to live with these issues, using the data which are available, and it would be better if we did on the Web too, rather than making up spurious numbers.
Turner (2004)
Top Related