[IEEE MELECON 2014 - 2014 17th IEEE Mediterranean Electrotechnical Conference - Beirut, Lebanon...

5
Abstract— There are some methods to track general users' behavior on the Internet, however they are private to companies and are not meant for identifying referral flows through public Web servers, especially when these flows span multiple hops. We present a public method for such tracking, with the end goal of studying the interactions among Web servers. The outcome of this study could lead to building partnerships among Web servers, and used for deciding on the particular services to be offered by cloud servers. For proof of concept, we implemented a prototype in Java, and used a message broker to provide inter- server communication. Experimental results were used to build graphs that illustrated the referrals among the servers. Index TermsCloud Computing; Session Tracking; Collaborating Web Servers; Click Stream; ActiveMQ I. INTRODUCTION AND RELATED WORK he Internet has witnessed an exponential growth of online businesses for the last two decades, with the number of live websites reaching about 367 Millions in the year 2011 [1]. In addition, the number of users has also increased significantly since the nineties [2]. This increase shows that studying the different aspects of Web Surfing has a major importance, especially in Internet flow analysis. It provides insights into the future, and helps website providers modify the attributes of their sites; hence, become more effective in reaching the intended customer base, and increase profits. Many tools for Internet flow analysis exist already, but they mostly aim at providing tracking data for a single website using cookies. This type of work aims to map the user click stream onto a graph whose vertices are the visited web servers and the edges are the user clicks on links (i.e., navigations from servers to others). For this, several technical challenges will have to be overcome, including working within the design of the HTTP protocol and finding a way to track the user’s session across visited web servers. Among the desired requirements is user privacy, which should not be violated while tracking the sessions. The developed mechanism should accommodate any number of servers generating click stream reports. Finally, the implemented scheme should not affect the response time of the user who is navigating the servers, and the whole process should remain transparent to him or her. There are few previous works which tackled the topic of tracking a user's click stream in a web session. Our first task was to define a Web session in order to track the user's behavior. Previous research in this area suggested a five- minute inactivity as a mean to end a session [3]. When using this criterion, another research found that over a period of two months, a user's click stream is split into an average of 520 sessions. This way, it was found that a typical session would last for about ten minutes, and it includes about sixty requests to twelve different Web servers. Due to strong dependence on the particular timeout used, the work in [4] tried to find an alternative for defining a session. An algorithm was devised to segment the user's click stream into many logical sessions and assign Web requests to the session with the most recent use of the referring URL. A logical session connects requests related to the same browsing behavior. On the other hand, the authors in [5] describe how to track the activity of a user who visits certain websites and how to build sessions. There are several methods to track users; using cookies and URL rewriting. Cookies could be provided by a server, or application cookies could be used to organize and provide them to users. The URL rewriting method rewrites each URL on each page and includes a special tracking id in the query string of the URL. Having several tracking methods such as Cookies, query string, IP addresses, user agent and time stamp, sessions could be built. A session is defined to be the sequence of pages viewed and actions taken by a single user. There are two ways to end a specific session; a period of inactivity (for example 30 minutes) or when a user logs out from a website that provides authentication. When a user visits a web page for the first time, the first request made by the user’s browser to the web server does not contain user- tracking cookie (De-Heading problem), and the subsequent requests to the web server will carry with them cookies. The De-Heading problem will cause two problems; loss of the session referrer (it does not tell how the visitor arrived to the web server) and over count sessions (the first request will have its own session since it does not have an ID or a cookie). The solution of the de-heading problem proposed in [5] is to use re-heading. First, the user agent is sent to the server to sub-set the web log records by the user agent. That is, the records are divided to those that have no ID (the very first request to the web server, or if the user turned off his/her cookies in the browser) and to those that have an ID (for example cookies from a server). Records are then compared to see which share some attributes (such as an IP) in order to move the records Identifying the linkability between Web servers for Enhanced Internet Computing Ammar El Halabi, Ali Hachem, Louay Al-Akhrass, Hassan Artail, and Habib Ullah Khan* Electrical and Computer Engineering Department, American University of Beirut, Beirut, Lebanon Emails: {aae48, ahh42, lma22, hartail} @ aub.edu.lb *College of Business and Economics, Qatar University, Doha, Qatar. Email: [email protected] T 17th IEEE Mediterranean Electrotechnical Conference, Beirut, Lebanon, 13-16 April 2014. 978-1-4799-2337-3/14/$31.00 ©2014 IEEE 139

Transcript of [IEEE MELECON 2014 - 2014 17th IEEE Mediterranean Electrotechnical Conference - Beirut, Lebanon...

Abstract— There are some methods to track general users'

behavior on the Internet, however they are private to companies and are not meant for identifying referral flows through public Web servers, especially when these flows span multiple hops. We present a public method for such tracking, with the end goal of studying the interactions among Web servers. The outcome of this study could lead to building partnerships among Web servers, and used for deciding on the particular services to be offered by cloud servers. For proof of concept, we implemented a prototype in Java, and used a message broker to provide inter-server communication. Experimental results were used to build graphs that illustrated the referrals among the servers.

Index Terms— Cloud Computing; Session Tracking;

Collaborating Web Servers; Click Stream; ActiveMQ

I. INTRODUCTION AND RELATED WORK he Internet has witnessed an exponential growth of online businesses for the last two decades, with the number of

live websites reaching about 367 Millions in the year 2011 [1]. In addition, the number of users has also increased significantly since the nineties [2]. This increase shows that studying the different aspects of Web Surfing has a major importance, especially in Internet flow analysis. It provides insights into the future, and helps website providers modify the attributes of their sites; hence, become more effective in reaching the intended customer base, and increase profits.

Many tools for Internet flow analysis exist already, but they mostly aim at providing tracking data for a single website using cookies. This type of work aims to map the user click stream onto a graph whose vertices are the visited web servers and the edges are the user clicks on links (i.e., navigations from servers to others). For this, several technical challenges will have to be overcome, including working within the design of the HTTP protocol and finding a way to track the user’s session across visited web servers. Among the desired requirements is user privacy, which should not be violated while tracking the sessions. The developed mechanism should accommodate any number of servers generating click stream reports. Finally, the implemented scheme should not affect the response time of the user who is navigating the servers, and the whole process should remain transparent to him or her.

There are few previous works which tackled the topic of tracking a user's click stream in a web session. Our first task was to define a Web session in order to track the user's behavior. Previous research in this area suggested a five-minute inactivity as a mean to end a session [3]. When using this criterion, another research found that over a period of two months, a user's click stream is split into an average of 520 sessions. This way, it was found that a typical session would last for about ten minutes, and it includes about sixty requests to twelve different Web servers. Due to strong dependence on the particular timeout used, the work in [4] tried to find an alternative for defining a session. An algorithm was devised to segment the user's click stream into many logical sessions and assign Web requests to the session with the most recent use of the referring URL. A logical session connects requests related to the same browsing behavior.

On the other hand, the authors in [5] describe how to track the activity of a user who visits certain websites and how to build sessions. There are several methods to track users; using cookies and URL rewriting. Cookies could be provided by a server, or application cookies could be used to organize and provide them to users. The URL rewriting method rewrites each URL on each page and includes a special tracking id in the query string of the URL. Having several tracking methods such as Cookies, query string, IP addresses, user agent and time stamp, sessions could be built. A session is defined to be the sequence of pages viewed and actions taken by a single user. There are two ways to end a specific session; a period of inactivity (for example 30 minutes) or when a user logs out from a website that provides authentication. When a user visits a web page for the first time, the first request made by the user’s browser to the web server does not contain user-tracking cookie (De-Heading problem), and the subsequent requests to the web server will carry with them cookies. The De-Heading problem will cause two problems; loss of the session referrer (it does not tell how the visitor arrived to the web server) and over count sessions (the first request will have its own session since it does not have an ID or a cookie). The solution of the de-heading problem proposed in [5] is to use re-heading. First, the user agent is sent to the server to sub-set the web log records by the user agent. That is, the records are divided to those that have no ID (the very first request to the web server, or if the user turned off his/her cookies in the browser) and to those that have an ID (for example cookies from a server). Records are then compared to see which share some attributes (such as an IP) in order to move the records

Identifying the linkability between Web servers for Enhanced Internet Computing

Ammar El Halabi, Ali Hachem, Louay Al-Akhrass, Hassan Artail, and Habib Ullah Khan* Electrical and Computer Engineering Department, American University of Beirut, Beirut, Lebanon

Emails: {aae48, ahh42, lma22, hartail} @ aub.edu.lb *College of Business and Economics, Qatar University, Doha, Qatar.

Email: [email protected]

T

17th IEEE Mediterranean Electrotechnical Conference, Beirut, Lebanon, 13-16 April 2014.

978-1-4799-2337-3/14/$31.00 ©2014 IEEE 139

that have no ID to the records that have ID. Also, time stamps are said to help in determining which session a NoID record belongs to.

Finally, Glommen in his work in [6] attempted to obtain a flow path using cookies and timestamp data from web servers log files. However, this assumes allowing a centralized authority access to the log files of the web servers, which would give rise to several problems.

II. DESIGN OF SESSION TRACKING MECHANISM A high level overview of the proposed system that we are

developing is illustrated in Figure 1. At the core of our system's design is the HTTP protocol. An HTTP message contains many header fields, one of which is the referrer field which contains the URL of the last page visited before the current one after clicking on a link from one page to the other. Two scenarios for the tracking mechanism were considered. In both, the system assigns an ID to each user session, which will be a unique identifier of the session local to the server. The first scenario for the tracking mechanism uses URL Rewriting to append the unique ID which identifies the session, along with the URL of each website visited to the query string. Hence the last website visited will contain in its query string the ID and the entire click stream. Below is an example on how to append data to the query string of a URL:

http://www.fyp.com/reviews.asp?article=24386;sessionid=IE50076359

Figure 1. General overview of proposed system

In the second scenario, the ID is perpetuated using URL

Rewriting, but the click stream is obtained along with time stamps by sending the consecutive URLs via a message broker, namely Apache ActiveMQ [7]. Our approach was implemented based on the second scenario. Below are the details of these scenarios, which are shown in Figures 2 and 3.

A. Tracking scenario using URL Rewriting 1. The user requests a webpage from server A (denoted “first

server”), his browser issues a GET command to the server. 2. Server A checks whether the user has a session ID in the

query string. As it does not have one, Server A creates an ID for him. It will inject this ID appended to the URL of

the webpage on server A to all external links, and the webpage is sent back to the user’s browser.

3. The user clicks on a link leading to a page on server B, causing his user-agent will send a GET request to server B.

4. Server B receives the GET request, including in the query string the session ID plus the URL of Server A.

5. Server B modifies the external links and includes the user’s session ID, Server A web page URL, and its own webpage URL. The modified webpage is sent back to the user. Server B will send a notification to Server A via a message broker. It will know the IP of Server A from the referrer field in the HTTP request.

6. When the user clicks on a link leading to a webpage on server C, the same procedure occurs.

7. The session will end when the last server does not receive a notifying message from a possible next server for a period of time, e.g., 20 minutes.

8. The last server will send the sequence of URLs to a Base Server which will do the processing, as we shall explain.

Figure 2. Illustration for the URL Rewriting Scenario

B. Tracking scenario using inter-servers communication In this scenario, each web server contains a database which comprises two tables: ClickStream_Data and Ship_Data. The first table contains information about the sessions which started at this server and are in progress (i.e., have not been terminated yet). The second table on the other hand contains a list of session summaries waiting to be shipped to the Base Server, where they get analyzed and processed along with records from other “first servers”. Each record of Clickstream_Data comprises the following data about every click on a link in a session (out of the ensemble of links, starting at the server containing this record); the ID of the session to which the click belongs (SID), a timestamp reflecting the time of the click (TimeStamp) and the IP of the server visited by the user through the click (ServerIP). On the other hand, the second table (Ship_Data) contains all the IDs of sessions in which the server is the first visited server. Hence each record will contain the ID of the session (SID), with a

17th IEEE Mediterranean Electrotechnical Conference, Beirut, Lebanon, 13-16 April 2014.

140

flag (ShipReady), and a timestamp representing the last click in the session (Tlast). The relationship between the two tables is shown in the example of Table 1, with two sessions.

Table 1. Ship_Data and ClickStream_Data Tables

The scenario is described in the following list of steps:

1. The user opens a website residing on server A. His browser issues a GET request.

2. Server A will check that the Referrer field of the GET request contains no URL. Then it retrieves its own IP - which is always the IP of the first server in the click stream - and appends to it a randomly generated code to form the session id. Upon generating this code, Server A checks in table Ship_Data, which contains the IDs of all the sessions previously initiated at Server A, whether this code is unique. If it is not, it would keep creating another code until it is unique. The obtained string is used as an ID for the session, and is stored in a new record in table ClickStream_Data (Session id, server IP, and time stamp reflecting the time of the start of the session). The server then injects this ID into all external links on the webpage which it sends to the browser.

3. The user clicks on a link on the webpage sent by Server A leading to a website residing on server B.

4. Server B extracts the ID mentioned above from the query string, injects it into all external links of its webpage, and sends back the webpage to the browser. It should be noted that Server B will not add any information to its own ClickStream_Data table.

5. Server B also extracts from the ID, the IP of the server initiating the session (i.e., Server A), and uses it to send Server A its own IP, with a timestamp reflecting the time of the visit to Server B.

6. When Server A receives the message from Server B, it creates a new record in ClickStream_Data comprising the IP of server B, the received time stamp, and the session ID obtained from the query string.

7. After receiving the webpage from Server B, the user may click on a link leading to a website residing on server C.

8. As was the case with Server B, Server C will send the modified webpage to the user.

9. Server C, which is able to extract the IP of Server A from the query string, sends to Server A its own IP, and the timestamp denoting the user’s click.

10. As was the case in Step 6, Server A inserts in ClickStream_Data a new record comprising the information about the visit to Server C.

11. The process continues until Server A receives no further messages from other web servers for the same session. To do this, it uses a timeout scheme to decide on the end of the

click stream session.Then, Server A inserts a record in the Ship_Data table, including the information mentioned at the top of this section.

12. Server A, after a short period of time, sends the data to the Base Server, as elaborated in the section that follows.

Figure 3. Illustration for the Second Scenario

C. Pooling the data We call the process by which the web servers send the click

streams data to the Base Server the pooling mechanism. As explained in the previous section, each server contains a database made up of two tables: ClickStream_Data and Ship_Data. In addition to the fields which we mentioned previously for the Ship_Data table (SID, Tlast), there is also a flag, named ShipReady, that is set to 1 for the records of ClickStream_Data that have become ready to be shipped to the Base Server. Every time a message corresponding to a click is received by the server (which is the first server in the click stream), a new record in ClickStream_Data is created containing the session ID, a timestamp reflecting the time of the click, and the IP of the server visited by the click.

For a given session the flag ShipReady is initially set to 0. Every amount of time equal to tcheck (e.g., 5 seconds), a process on the server will get the timestamp of the latest click in the session and stores it in the field Tlast in Ship_Data. After an amount of time equal to tthreshold passes beyond the last update of Tlast, the process sets ShipReady to 1. When ShipReady is set to 1, the records of the session are ready to be sent at the next time ticket, which will be every Tship (e.g., 1 minute). But, when an update occurs before sending the data, ShipReady will be reset to 0. If at the time ticket ShipReady is equal to 1, a process on the server extracts the records from

17th IEEE Mediterranean Electrotechnical Conference, Beirut, Lebanon, 13-16 April 2014.

141

ClickStream_Data, and transfers them to the Base Server. Afterwards, the records get deleted from ClickStream_Data.

Every time an update to the session is received by the first visited server, Tlast and ShipReady in table Ship_Data will be reset accordingly, i.e., Tlast will be set to the timestamp corresponding to the last update, and ShipReady to 0.

In this case, the same process will occur, i.e. when, tcurrent – tlast ≥ tthreshold ShipReady will be set to 1, and if at the next time ticket Shipready has a value of 1 (no new updates), the related data will be sent to the Base Server and added to the previously sent data related to the same session. Hence the clickstream of a session will be sent in chunks of records.

If a session witnesses a period of 24 hours without any update, the session will be considered terminated, and the corresponding record will be deleted from Ship_Data.

III. IMPLEMENTATION AND TESTING To obtain a small network of servers, we used VMWare which allowed us to set up 9 virtual machines. VMware is a software that leverages the power of virtualization to transform datacenters into simplified cloud computing infrastructures and enables IT organizations to deliver flexible and reliable IT services [8]. Seven of these virtual desktops were used as Web Servers hosting Web Applications, one was used as the Base Server, and the last one was used as a client which browses the websites on the Web servers. On each of the seven Web Servers, we have deployed a Web Application that links to all the other ones, including itself. These applications were developed using JSP, where each contained four java classes: initiation, producer, receiver, and shipment (illustrated in Figure 4). The first class implements two main functions:

− Creating an ID for the session at the moment it starts if the server is a "First Server".

− In the event of a click in the session leading to the server in question, the second function is to send the corresponding data to the "First Server" as elaborated in the design section (Session ID, the IP of the visited server, and a time stamp).

The sending is performed in the first class using the producer class which implements the functionality of ActiveMQ [7] corresponding to the producer (server) which sends the message to the receiver (another server) whose functionality is implemented in the receiver class. The last class (shipment) is intended to achieve the pooling of data as described in the design section. It also uses producer class to send the corresponding data.

As for the "Base Server", it runs a Web Application which includes three java classes, two of which are BaseServer and GraphSet. The first is kept running all the time and has the function of receiving data from all possible "First Servers". The second class has the function of retrieving data from the "Base Server" database, and creating a list of clicks (Server1 - Server2) which is used to create the final graph where nodes represent the Web Servers, and the edge weights represent the number of clicks which referred the user from one server to the other. Also, this class computes how many times each click has been performed, using the list of clicks generated.

To test our system, we performed 173 sessions or click streams, surfing between the seven Web Servers named S1 to S7, with IPs 192.168.202.4 - 192.168.202.10. The click streams corresponding to these sessions were chosen randomly. Each session had an initiating server, which sent the resulting click stream to the base server (192.168.202.2), where the GraphSet class is used to obtain the list of clicks, and to compute the weight of each edge (total number of clicks from one server to the other in all the click streams).

Figure 4. Class Diagram illustrating the implemented classes

Below in Figure 5 is a generated graph, where the vertices are the servers and each edge is the referral from one server to another, with the weight representing the number of referrals across all the sessions built. The graph was obtained using NodeXL, a free and open-source template for MS Excel 2007 which is used to explore network graphs [9].

Figure 5. Graph illustrating of referrals between servers

In addition to direct visits from servers to others (shown in Figure 6), multi-hop visits from each server were obtained. Figure 7 shows results for 5-hops away visits from server S2.

17th IEEE Mediterranean Electrotechnical Conference, Beirut, Lebanon, 13-16 April 2014.

142

Figure 6. 1-hop-away visit from Server S2

Figure 7. 5-hops-away visit from Server S2

IV. CONCLUSION This paper presented a system for tracking web server

traversals, and uses inter-server communication for reporting one-hop referrals that are aggregated by first-visited servers, before sending them to the base server for building or updating flow graphs. This system may have a large impact as an Internet Flow Analysis tool, as it would allow Web Service providers to achieve major goals, some of which are:

1. Improving e-advertising: An e-business that finds out that it is heavily accessed through a website could advertise on other similar websites. For example, if the admin of website A gets to know that it is greatly referred by website B, she can study the background of this website and advertise on similar websites.

2. Offloading common services to Cloud Servers: Web servers that have services in common (inferred from the referral traffic) can partner to establish or outsource to cloud servers that provide common services or products that are identified to cause the referral traffic. This can help improve scalability and reduce traffic in the Internet through offloading the common services to the Cloud and potentially share the cost.

3. Improving competitiveness: This may be realized through checking services provided on the “visited-through” (one or several-hops-away websites), for the purpose of comparing them with own services to improve them.

4. Improving search engines: By including in the search results partner Websites. For example, if a search engine displays in some results Website A, it may also display most visited websites that are referred by Website A.

5. Imposing fees on referrals: an e-Business can make profit based on the referral volumes to other e-Business Websites. Also, it can relate the most frequent visits to some services and increase their fees. In addition, it can strike partnerships

with other websites if it turns out that they form a major source of traffic into this website.

Hence, on the economic level, our proposed framework could create new opportunities for expanding the range of services, improving competitiveness, increasing revenue and entering into partnerships that allows websites to complement or supplements each other’s services. On the social level, the study of the individual's behavior on the Internet would have a huge impact. Understanding the source and type of incoming traffic, website administrators can understand their customers better, and can cater their marketing strategies accordingly. Nevertheless, many other benefits can be realized through the various types of statistics that can be derived from the data at the Base server. This however will be left to a future work.

ACKNOWLEDGEMENT This work was supported by a generous grant from the

Lebanese National Council on Scientific Research (LNCSR) under grant #4194

REFERENCES

[1] Statistic Brain, "Total number of websites" [Online]. Available: http://www.statisticbrain.com/total-number-of-websites/

[2] Miniwatts Marketing Group, "Internet World Stats", 2012 [Online]. Available: http://www.internetworldstats.com/stats.htm

[3] F. Qiu, Z. Liu, and J. Cho. Analysis of user web traffic with a focus on search activities. In A. Doan, F. Neven, R. McCann, and G. J. Bex, editors, Proc. 8th International Workshop on the Web and Databases (WebDB), pages 103–108, 2005.

[4] M. Meiss, J. Duncan, B. Gonçalves, J. Ramasco, F. Menczer. What's in a session: Tracking Individual Behavior on the Web. Proceedings of the 20th ACM conference on Hypertext and hypermedia, pages 173-182, 2009

[5] D. Koch, J. Brocklebank, R. Roach. Mining Web Server Logs: Tracking Users and Building Sessions.

[6] Charles Glommen et al, inventor; 2002 May. 21. Internet website traffic flow analysis. United States patent US 6,393,479.

[7] B. Snyder, D. Bosnanac, and R. Davies, ActiveMQ in action: Manning, 2011.

[8] VMware Inc., Introduction to VMware vSphere. [9] NodeXL Network Graphs, The Social Media Research

Foundation, [Online]. Available: http://nodexl.codeplex.com/

17th IEEE Mediterranean Electrotechnical Conference, Beirut, Lebanon, 13-16 April 2014.

143