Preprocessing of Web Log Data for Web Usage Mining

1

Preprocessing on Web Log Data for Web Usage Mining

Shahid Rajaee Teacher Training UniversityFaculty of Computer Engineering

PRESENTED BY:

Amir Masoud Sefidian

2

Outline:

• Introduction

• Web Logs Files

• Phases of Web Usage Mining

• Steps of Data Preprocessing• Data Cleaning

• User Identification

• Session Identification

• Path Completion

• Main references

3

Outline:

• Introduction• Web Logs Files





• Path Completion

• Main references

4

Introduction• Web has been growing as a dominant platform for retrieving information and discovering

knowledge from web data. • Web usage analysis or web usage mining or web log mining or click stream analysis:

• Process of extracting useful knowledge from web server logs, database logs, user queries, client side cookies and user profiles in order to analyze web users’ behavior.

• Applies data mining techniques in log data to extract the behavior of users which is used in various applications.

5

Outline:

• Introduction

• Web Logs Files• Phases of Web Usage Mining




• Path Completion

• Main references

6

LogsTYPES OF WEB SERVER LOG FILES • Access logs:

• It stores information about which files are requested from web server. • Referrer logs:

• Stores information of the URLs of web pages on other sites that link to web pages. • If a user gets to one of the server‘s pages by clicking on a link from another site, the URL of that site will appear in

this log. • Agent logs:

• It records information about the web clients that sends requests to web server. Contain type of browser and the platform determines what a user is able to access on a web site.

• Error logs:• It stores information about errors and failed requests of the web server.

Types of Web log file formats

• Common Log Format (CLF) • W3C extended log file format• Microsoft IIS (Internet Information Services) log file format • NCSA Common log file format

7

Sources of Log Data For Web Usage MiningServer side:• All the click streams are recorded into the web server log.• Contain basic information e.g. name and IP of the remote

host, date and time of the request etc. • The web server stores data regarding request performed

by the client.

Client side:• The client itself which sends information to a repository

regarding the users‘ behavior. • Done either with an ad-hoc browsing application or

through client side application running standard browsers.

Proxy side: • Proxy level collection is an intermediary between server

level and client level. • Proxy servers collect data of groups of users accessing

huge groups of web servers.

We consider only the case of a Web Server Log data.

8

Outline:

• Introduction

• Web Logs Files

• Phases of Web Usage Mining• Steps of Data Preprocessing

• Data Cleaning



• Path Completion

• Main references

9

Phases of Web Usage Mining:Data Preprocessing:• Transform the raw click stream data into a set of user

profiles.• One of the most complex phase of the Web Usage Mining

process.

Pattern Discovery:• Extracting information from preprocessed data. • Data mining, statistics, machine learning and pattern

recognition are applied to web usage data to discover user access patterns of the web.

Pattern Analysis: • Extract the interesting patterns from the pattern discovery

process by eliminating the irrelative patterns. • Involves :

• Validation: remove the irrelative patterns• Interpretation : using visualization techniques to

interpret mathematic results for humans.

Our Focus

10

Outline:

• Introduction

• Web Logs Files





• Path Completion

• Conclusion

• Main references

11

Steps of Data Preprocessing

12

Outline:

• Introduction

• Web Logs Files


• Steps of Data Preprocessing

• Data Cleaning• User Identification


• Path Completion

• Conclusion

• Main references

13

Data Cleaning:• Irrelevant or redundant log records will be removed. • Clean accessorial resources embedded in HTML file, robots requests and error requests.• Almost no researches focus purely on web log cleaning.

Attributes Involved in Web Log Cleaning and Intrusion Detection:• Multimedia(images, videos and audio) Files:

• Categorized as useless files in web log preprocessing.• Web log files size can be reduced to less than 50% of its original sizes by eliminating the image request.

• Web Robots Request :• Dramatically affect the web sites traffic statistics.• These are not important from the mining perspective and hence must be removed.

• HTTP Status Codes:• Log files with unsuccessful HTTP status code are usually eliminated during the web log cleaning process. • The widely acceptable definition for unsuccessful HTTP status codes is a code under 200 and over 299.• For Intrusion Detection:

• [3] Removes all log files with status code 200. • [2] Argued that log files with status 200 series should be remained as these log files may include

web attacks like SQL injection and XSS which have been executed successfully.

14

Attributes Involved in Web Log Cleaning• HTTP Methods:

• A few researches have included HTTP method as an attribute in web log cleaning. • In the LODAP Data Cleaning Module All log files with HTTP request method other than GET should be

removed as these are non-significant in web usage mining. • For Intrusion Detection:• Someone proposed:

• HTTP request with POST method should be kept. • Another one proposed:

• Keep the log files with HTTP GET and HEAD request to obtain more accurate referrer information.

• Other Files:• Log files with request to accessorial resources (e.g. CSS file) embedded in HTML file should be removed.

15

Algorithm Design of Newest Methods:• This method used for Data Cleaning and Intrusion Detection from log files.• Total of six cleaning conditions is applied:First:Logs with HTTP status code 200 will be removed (probability for web logs with such criteria to contain malicious web attacks is almost zero).

Second:• Web logs with multimedia file extensions will be removed if

• The HTTP request in the web log is not HTTP POST and• The HTTP status code is not 400 series and 500 series.

• Web logs with status code 400 series and 500 series should be kept as these may consider as malicious attempt.• Users who triggered many web logs with HTTP error status code are subject to suspect.• In common case, to launch web defacement attack, attacker will use HTTP POST method to replace part or all of the web

interface components.

Third:• Legitimate web robots requests like Googlebot will be removed.• Specific IP address will be included in the web robot IP whitelist. • If there are web logs with web robots request from whitelist IP addresses, the web log will be removed.

16

Algorithm Design of Newest Methods:

Fourth:Remove web log with legitimate file extension(.css, .pdf, .txt and .doc) if :The web logs contain no HTTP status codes with 400 series and 500 series and the HTTP method is not HTTP POST.

Fifth:Web log with HTTP HEAD method(used in a web monitoring system) and legitimate IP will be removed.A large number of HTTP HEAD requests may indicate malicious web robots activities.

Sixth:Web log with HTTP POST method will be removed if the file posted are legitimate. For instance, it is legitimate if there is web log with .svc file extension in uri-stem and with HTTP POST method. .svc file is a special content file which represents the Windows Communication Foundation (WCF) service hosted in IIS.

17

ImplementationWeb log format:Internet Information Services (IIS) Log Format

Simulating attacks carried out by using three web vulnerability assessment tools:

Acunetix(run on Microsoft Windows)Nikto and w3af(run on BackTrack GNOME)

An e-commerce site web server is configured to send web logs to the log collector

server via User Datagram Protocol (UDP).

Architectural Diagram for Simulation Attack and Web Log Collection

18

Comparison of existing frameworks:

[1]: Salama, S.E., Marie, M.I., El-Fangary, L.M., Helmy, Y.K. 2011[2]: Patil, P., Patil, U. 2012[3]: Yew Chuan Ong and Zuraini Ismail 2014

[1], [2] considered only three files extensions(.jpg, .gif and .css).

Algorithm 3 defined a total of sixteen multimedia file extensions + four other files extension.

Comparisons FactorAlgorithms

[1] [2] [3]

Multimedia Files Yes Yes Yes

Web Robots Request No No Yes

HTTP Status Code200,

400 series,500 series

200200,

400series,500 series

HTTP Method No GETGET, POST,

HEAD

Others Files Yes Yes Yes

Number of Rules and Conditions 2 1 6

19

Evaluation of existing frameworks:Evaluate the cleaning capability:

• size of web log file in bytes.• # of web log entries based on the total number of lines in the web log file.

Percentage of reduction = (total # of web log entries removed / total # of web log entries) × 100%higher is the percentage of reduction => the better is the cleaning capability

Evaluate the Intrusion Detection Readiness:

False negative rate = Total number of malicious request removed / total number of malicious requestLower false negative rate => better intrusion detection readiness

Measuring Factors

Algorithms

[1] [2] [3]

File Size Reduced (bytes) 6945603 32423581 18957149

Number of Entries Removed 52916 215616 153372

Percentage of Reduction (%) 13.94 56.81 40.41

False Negative Rate 0.00144 0.15789 0.00531

Algorithm 3 has the second highest percentage of reduction and second lowest false negative rate compared

to the other algorithms.

20

Outline:

• Introduction

• Web Logs Files



• User Identification• Session Identification

• Path Completion

• Main references

21

User Identification: Identify each distinct user. User identification is one way of introducing a state into web stateless system. A very complex task because of proxy servers and caches.1 User identification by IP address:

“Each different IP address represents different user.” Problems:• Several users can be used the same IP address or computer (i.e. college, internet café etc.).• One user can have different IP addresses, since a user accesses the Web from different machines will have different IP address.

2 User identification using User registration Data:If users have login of their information, it is easy to identify them. Username and password are also stored in the web log files.Problems:

• But these facilities are not available in every website so that it is not appropriated for the general web browsing .• There are lots of user do not register their information.

3 User identification using Cookies:Cookies are HTTP headers in string format. By using Cookies we can extract the details of users and resources which are accessed by the user. Problems:

• Users can lock the use of cookies.• Users can delete the cookies.

22

Two heuristics proposed that can be used to help identify unique users:• (P. Pirolli,J. Pitkow, and R. Rao) and (K.R. Suneetha, Dr. R. Krihnamoorthi(2009)) proposed:“Even if the IP address is the same, if the agent log shows a change in browser software or operating system so:

Each different agent type for an IP address represents a different user.”

User 1: A→B → E →K →I → O→E →LUser 2: A → C →G →M →H→N

23

Another heuristic(L. Chaofeng (2006) and V. Chitraa (2010) ):Use the access log in conjunction with the referrer log and site topology to construct browsing paths for each user:“If a page is requested that is not directly reachable by a hyperlink from any of the pages visited by the user assumes that

there is another user with the same IP address”.

Following the referrer field along user 1’s path through the Web site.Unexpectedly, there is no referrer shown for the page I.html request.There is no direct link between K.html and I.html: It appears highly unlikely that the user who was traversing A→B→E→K then proceeded to I.It is more likely that this request for page I.html came from a third user, who accessed the page directly, probably by entering the URL directly into the browser using the same browser version and operating system:

User 1: A→ B →E →K → E →L , User 2: A → C →G → M →H→N , User 3: I → O

24

• P. Yeng, Y. Zheng(2010) dedicated only to user identification through inspired rules:• Four constraints are used to identify users. These constraints are: IP address, agent

information, site topology and time information.• Has low efficiency, but accuracy increased significantly

• “Renáta Iváncsy, and Sándor Juhász” analysis of different user identification methods at “Analysis of Web User Identification Methods”

• Heuristics are not error-proof. • Different heuristics must be selected depending on different situations and applications.

25

Outline:

• Introduction

• Web Logs Files




• Session Identification• Path Completion

• Main references

26

Session Identification(Sessionization, Session Reconstruction):Session Definition:

• Group of activities performed by a user from the moment he entered the website to the moment he left it.

• A set of user clicks usually referred to as a click stream, across Web servers is defined as a user session.• A sequence of web pages user browse in a single access.

Session Identification :• Grouping the different activities of a single user.• The process of segmenting the access log of each user into individual access sessions.

Session identification Goal:• Group the page access of each user into individual access sessions.• Identifying which user has spent how much time on the website.• Each heuristic h scans the user activity logs to which the web server log is partitioned.

Two general approaches:• Time-oriented heuristic methods• Navigation-oriented heuristic methods

27

Session Identification:• Time-oriented heuristic methods: A set of pages visited by a specific user is considered as a single user session if the pages are requested at a time interval not larger than a specified time period.

First Heuristic:Total session duration may not exceed a threshold 𝜃. 𝑡0: the timestamp for the first request in a constructed session S.“The request with a timestamp t is assigned to S, iff t − 𝑡0 ≤ 𝜃” (Liu, 2007).

𝜃 = 30𝑚𝑖𝑛 has been recommended from empirical findings (Spiliopoulou, Mobasher, Berendt, & Nakagawa, 2003).Second Heuristic:For the page-stay-time-based method:Total time spent on a page may not exceed a threshold 𝛿. 𝑡1: the timestamp for request assigned to constructed session 𝑆Next request with timestamp 𝑡2 is assigned to S iff 𝑡2 − 𝑡1 ≤ 𝛿 Liu, 2007. A conservative threshold for page-stay time is 𝛿 = 10𝑚𝑖𝑛 has been proposed to capture the time for loading and studying the contents of a page (Spiliopoulou et al., 2003).

28

Session Identification:• Navigation-oriented heuristic methods : Web users reach pages by following hyperlinks rather than by typing URLs.

Topology-based heuristic:“If a web page is not connected with previously visited page in a session, then it is considered as a different session.”

Referrer-basic heuristic(Cooley et al. (1999) ) based on the referrer information : • The referrer of a requested page P should be a page already in the session(previously

visited pages); otherwise P is assigned to a different session. • If The page has an empty referrer, then it is likely to be the first page of a new session.

29

Session Identification:• “Spiliopoulou” evaluates different heuristics in “A Framework for the Evaluation of Session

Reconstruction Heuristics in Web Usage Analysis”:

• Time based methods are not reliable because users may involve in some other activities after opening the web page.

• Referrer-based heuristics are more restrictive than the topology-based heuristics, because there are cases where a page request has an empty referrer.

• Different methods are used by different applications. • Experiments showed that there is no best heuristic for all cases.• Even for a simple application, two variations in the method of assessing reconstruction quality led

to significantly different precision scores among the heuristics

• G. Shivaprasad, N.V. Subba Reddy, U. Dinesh Acharya and Prakash K. Aithal (2016) proposed:• A combined technique based on both the heuristics for Session Identification.• Uses web topology and page stay time.

30

Session Identification(Time-oriented heuristic example):

Session 1 (user 1): A →B→E →KSession 2 (user 2): A → C →G → M →H →NSession 3 (user 3): I → OSession 4 (user 1): E →L

For user 1, there is more than a 30-minute delay between the request for page K.html and

the second request for page E.html,so :

31

Outline:

• Introduction

• Web Logs Files





• Path Completion • Main references

32

Path Completion :• Critical phase in the preprocessing.

• The number of URLs recorded in log maybe less than the real one:• Some important page requests are not recorded in server log due to proxy servers, browsers back

button is pressed and local caching.

• Definition:• “The process of reconstructing the user’s navigation path, by appending missed page requests (page

requests that are not recorded in server log) in order to analyze the data in a proper way within the identified sessions”.

• Used to obtain the complete user access path.

33

Path Completion :Methods similar to those used for user identification can be used for path completion.. Heuristic methods based on referrer log and site topology are employed.

Cooley, R., Mobasher, B., & Srivastava, J. (1999):Missing pages are added as follows: The page request is checked whether it is directly linked to the last page or not:

If there is no link with last page check the recent history. If the log record is available in recent history then it is clear that “back” button

is used for caching until the page has been reached.

34

Path Completion :Considering session 2 :

Session 2 (user 2): A → C → G →M →H→N

There is no direct link between page M.html and page H.html. Therefore, the user is presumed to have hit the “Back” button on the browser twice.

The path completion process leads us to insert “→G →C ” into the session path for session 2:

Session 2 (user 2): A → C →G →M →G →C →H →N

35

Outline:

• Introduction

• Web Logs Files





• Path Completion

Main references

36

Outline:

• Introduction

• Web Logs Files





• Path Completion

• Conclusion

• Main references

37

Main References:1. Ong, Y. C., & Ismail, Z. (2014). Enhanced Web Log Cleaning Algorithm for Web Intrusion Detection. In Recent

Advances in Information and Communication Technology (pp. 315-324). Springer International Publishing.

2. Salama, S.E., Marie, M.I., El-Fangary, L.M., Helmy, Y.K.: Web Server Logs Preprocessing for Web Intrusion Detection. Computer and Information Science 4, 123–133 (2011)

3. Patil, P., Patil, U.: Preprocessing of web server log file for web mining. World Journal of Science and Technology 2, 14–18 (2012)

4. Cooley, R., Mobasher, B., Srivastava, J.: Data Preparation for Mining World Wide Web Browsing Patterns. Journal of Knowledge and Information Systems 1 (1999).

5. Das, R., Turkoglu, I.: Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method. Expert Systems with Applications 36(3), 6635–6644 (2009)

6. P. Yeng, Y. Zheng. (2010). Inspired Rule-Based User Identification, LNCS 6440, pp. 618-624.

7. K.R. Suneetha, Dr. R. Krihnamoorthi. (2009). Identifying User Behavior by Analyzing Web Server Access Log File, IJCSNS, 2009.

8. …

QUESTION??...

Preprocessing of Web Log Data for Web Usage Mining

Technology

Transcript of Preprocessing of Web Log Data for Web Usage Mining