CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN...

19
116 CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL CRAWLERS USING FUZZY LOGIC 6.1 Introduction The properties of the Internet that make web crawling challenging are its large amount of data, its dynamic page generation and its rapid rate of change. The web crawler must be scalable, robust and make efficient use of available bandwidth, while all crawlers are built around standard components. Politeness is an important issue which needs to be addressed when designing a web crawler. Crawlers should not overload a web server by requesting a large number of web pages in a short interval of time. Web crawler should follow restrictions outlined by web site administrators; they should also identify themselves when requesting pages. The crawlers observe a waiting time between two simultaneous requests to a web server. This waiting time is called Request Intervals. It is generally 30secs between two downloads. To enforce this waiting time a shuffling mechanism inside of the queue is implemented, the queue is scrambled into a random order so that URLs from the same web server are spread out evenly throughout the queue. The other crawler like Mercator implements their URL queue as a collection of sub-queues: each domain has its own queue. 6.2 Quality and Network Metrics There is always a scope to improve the quality of the data collected during a crawl. The ordering of the URL queue determines the type of search of the web graph. The queue can be ordered by taking into account the in-link factors of pages. The breadth first search can improve the quality of downloaded pages. There exist a large number of infinitely branching crawler traps and spam sites on the Internet whose pages are dynamically generated and designed to have a very high in-link factor. In this section the various network metrics like Geographic Distance and Latency are discussed.

Transcript of CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN...

Page 1: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

116

CHAPTER 6

SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING

PARALLEL CRAWLERS USING FUZZY LOGIC

6.1 Introduction

The properties of the Internet that make web crawling challenging are its large amount

of data, its dynamic page generation and its rapid rate of change. The web crawler must

be scalable, robust and make efficient use of available bandwidth, while all crawlers are

built around standard components. Politeness is an important issue which needs to be

addressed when designing a web crawler. Crawlers should not overload a web server by

requesting a large number of web pages in a short interval of time. Web crawler should

follow restrictions outlined by web site administrators; they should also identify

themselves when requesting pages. The crawlers observe a waiting time between two

simultaneous requests to a web server. This waiting time is called Request Intervals. It

is generally 30secs between two downloads. To enforce this waiting time a shuffling

mechanism inside of the queue is implemented, the queue is scrambled into a random

order so that URLs from the same web server are spread out evenly throughout the

queue. The other crawler like Mercator implements their URL queue as a collection of

sub-queues: each domain has its own queue.

6.2 Quality and Network Metrics

There is always a scope to improve the quality of the data collected during a crawl. The

ordering of the URL queue determines the type of search of the web graph. The queue

can be ordered by taking into account the in-link factors of pages. The breadth first

search can improve the quality of downloaded pages. There exist a large number of

infinitely branching crawler traps and spam sites on the Internet whose pages are

dynamically generated and designed to have a very high in-link factor. In this section

the various network metrics like Geographic Distance and Latency are discussed.

Page 2: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

117

6.2.1 Geographic Distance

There exist services on the Internet that provide a mapping between IP addresses and

geographic information. The existing Internet service parse registration data to derive

longitude, latitude from registrar address data. If two hosts share a common latitude and

longitude, then they are managed by the same ISP. Once the latitude and longitude have

been obtained for a pair of Internet hosts, their geographical distance can be calculated

using spherical coordinates on the earth.

6.2.2 Latency

There are various ways of determining Round Trip Time between two Internet hosts.

First method is by using Unix Ping utility and secondly method uses the Traceroute

utility. The Ping utility uses ICMP ECHO requests; however the ICMP replies are

sometimes blocked or manipulated by ISPs. Traceroute sends out TTL restricted UDP

packets which might be blocked by some routers.

6.2.3 Correlation between Metrics

There is strong Correlation between Latency and Geographic Distance. The

observations are lower values of linearized distance, the correlation between distance

and RTT is stronger. Linearized distance along a path implies a minimum end-to-end

RTT. Linearized distance and RTT are more strongly correlated than end-to-end

distance and RTT.

6.3 Case Study of Crawler Load

Figure 6.1 illustrated the client throughput in traditional and active network. The

vertical axis denotes the client throughput, number of bits received by clients

/simulation time unit and the horizontal axis denotes the client arrival rate of request

[159]. The client throughput for 0% overhead active indexing is proportional to that for

the 0% crawler. This establishes the comparability of the remaining cases. As the

systems become saturated the throughput drops rapidly, after both the simulations

achieve the similar throughput of about 222 bits/tick. Then the throughput remains the

same at about 140 bits/tick [159].

Page 3: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

118

Figure 6.1: Client throughput in all cases [159]

Figure 6.2 showed the traditional network crawler throughput. The vertical axis denotes

the number of bits per simulation time unit received by crawlers and total request arrival

rate is denoted by horizontal axis. The requests are originated by both human clients

and crawlers [159].

Figure 6.2: Crawler throughput

Page 4: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

119

Figure 6.3 illustrated the average client request delay for active indexing. The vertical

axis denotes the average client response delay while the rate at which the request is

generated by human clients is denoted by horizontal axis. The average client delay in

traditional network with 20% or 40% crawler traffic is more in active networks [159].

Figure 6.3: Average Client request delay in all cases [159]

Figure 6.4: Total Request Arrival Time vs. Average Crawler Request Delay [159]

Figure 6.4 demonstrated the graph between the average crawler request delays and the

total arrival rate of request. The above two curves are similar, which implies that as the

crawler load increased, it does not impact the delay seen by crawler sites [159].

Page 5: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

120

Figure 6.5: Completed Client Request Rates in all cases [159]

Figure 6.5 illustrated the fraction of client requests that are completed in all cases.

However when the request arrival rate is low all requests are satisfied. The 20% and

40% crawler cases show significant decrement in the rate at which client requests are

completed [159].

6.4 Fuzzy Inference Systems and Fuzzy Logic

A fuzzy inference system (FIS) uses a fuzzy inference engine to derive answers from

knowledge database. The fuzzy inference engine is like the brain of the expert systems

which provides the required methodologies for reasoning with the information in the

knowledge database and formalizing results. The extended branch of Boolean algebra

which deals with partial truth is fuzzy logic. Fuzzy logic denotes degree to which

proposition logic is true. In Boolean algebra everything can be expressed in terms of

binary values i.e., zero and one. Fuzzy logic replaces Boolean algebra values with the

level of truth. Level of truth is used to record the imprecise modes of reasoning. This

mode of reasoning plays an important role in the decision making ability of humans in

an atmosphere of imprecision and uncertainty. In fuzzy sets the membership function

are like the indicator function of the classical sets theory. Membership functions are

curves. Membership functions defines that each point is mapped to a value between 0

Page 6: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

121

and 1 in input space. The shape of a membership functions are triangular, bell curves

and trapezoidal. The input space is called universe of discourse

A Fuzzy Inference Systems are conceptually very simple and easier to implement. A

Fuzzy Inference Systems consists of three stages they are input stage, an output stage

and a processing stage. The input is mapped in the input stage into membership

functions. Appropriate rule is invoked at the processing stage and result is generated for

each rule, results of rules are combined. Then output stage converts the result into

output.

The processing stage is referred to as inference engine. Inference engine is based on a

set of logic rules of the form of IF-THEN statements. IF sub-statement is “antecedent”

and the THEN sub-statement is “consequent”. Fuzzy inference subsystems have n

number of rules which are stored in a knowledge database. The fuzzy inference system

has following steps:

• Fuzzification of inputs values.

• Application of fuzzy operators

• Applying implication methods

• Aggregation of outputs

• Defuzzification of results

The process of determining the degree to which input belong to its fuzzy sets via

membership functions is fuzzification of inputs. The input for the defuzzification

process is fuzzy set and the output is crisp value. There are two common used inference

methods in fuzzy sytems. The first method is Mamdani's fuzzy inference method

proposed by Ebrahim Mamdani in 1975 and the second method proposed in 1985 is

Takagi-Sugeno-Kang method of fuzzy inference. These methods are similar in many

ways, like the process of fuzzifying the inputs and fuzzy operators. Output membership

functions in Sugeno’s method are either linear or constant while in Mamdani’s inference

the output membership functions are fuzzy sets. Sugeno’s method is computationally

Page 7: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

122

efficient and it works well with optimization and adaptive techniques. Also it works

well with mathematical analysis.

The quality is maintained by the crawling process. The web crawling is done using

following approaches either the web crawlers can be allowed to communicate among

each other or they are not allowed to communicate among themselves. Both techniques

put extra burden on network traffic. Here a fuzzy logic based algorithm is proposed and

it is implemented in MATLAB using fuzzy logic tool box which predict the load at

particular node and route of network traffic.

6.5 Proposed Solution

1. Using Fuzzy Inference System to Solve Network Traffic problem in migrating

parallel Crawlers.

2. Defining FIS variables and fuzzification of the input variables using membership

function editor

3. Specifying rules for Fuzzy inference system using Rule Editor for Network

Traffic problem in Migrating parallel Crawlers.

4. Rule Evaluation

5. Aggregation of the rule output

6. Defuzzification of the output value.

6.6 Description

1. Using Fuzzy Inference System to Solve Network Traffic problem in migrating

parallel Crawlers.

The theory of fuzzy logic is based on fuzzy set. Each point in the input space is mapped

in between 0 and 1 (membership value) which is determined by the curve called as

membership function. A set without a clearly defined crisp boundary is called a fuzzy

set. The tools used for building, editing fuzzy inference systems in Fuzzy Logic

Toolbox are:

Page 8: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

123

1. Fuzzy Inference System (FIS) Editor

2. Membership Function Editor

3. Rule Editor

4. Rule Viewer

5. Surface Viewer

The Mamdani method is used as it is accepted widely for capturing knowledge. It

allows us to describe the expertise in more human –like manner.

2. Defining FIS variables and fuzzification of the input variables using membership

function editor

gaussmf: gaussmf is the Gaussian curve built-in membership function in fuzzy tool box.

The Syntax is given by y = gaussmf(x,[sig c]). The symmetric Gaussian function in

fuzzy tool box depends on two parameters σ and c as given by

For example if y=gaussmf(x,[2 5]);

plot(x,y)

xlabel('gaussmf, P=[2 5]')

Figure 6.6(a): gaussmf curve

Page 9: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

124

Trimf: trimf is the triangular-shaped built-in membership function in fuzzy tool box.

The syntax is given by y = trimf(x,params); let y = trimf(x,[a b c]) then the triangular

curve is a function of a vector x and depends on three parameters

or,

The first parameter a and third parameters c locate the base of the triangle and the

second parameter b informs about the peak of the triangle. For example:

x=0:0.1:10;

y=trimf(x,[3 6 8]);

plot(x,y)

xlabel('trimf, P=[3 6 8]')

Figure 6.6(b): trimf function

Page 10: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

125

Figure 6.6(c): FIS editor for Network Traffic Problem

Figure 6.7: FIS variable Communication

Page 11: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

126

Figure 6.8: FIS variable Bandwidth

Figure 6.9: FIS variable Noise

Page 12: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

127

Figure 6.10: FIS output variable NetworkTraffic

The figure 6.6(c) is the FIS editor for Network Traffic Problem. The figure 6.7 is the

FIS variable Communication. The figure 6.8 is the FIS variable Bandwidth. The figure

6.9 is the FIS variable Noise. The figure 6.10 is the FIS output variable NetworkTraffic

3. Specifying rules for Fuzzy inference system using Rule Editor for Network Traffic

problem in Migrating parallel Crawlers.

Communication Bandwidth Noise

Network

Traffic

low low low low

low low medium low

Page 13: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

128

low low high low

low medium low low

low medium medium medium

low medium high medium

low high low medium

low high medium medium

low high high high

medium low low low

medium low medium medium

medium low high medium

medium medium low medium

medium medium medium medium

medium medium high medium

medium high low medium

medium high medium medium

medium high high high

high low low medium

high low medium medium

high low high medium

high medium low medium

Page 14: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

129

high medium medium medium

high medium high high

high high low medium

high high medium high

high high high high

Table 6.1: Rules for FIS

Figure 6.11: Rules Editor for Network Traffic Problem

4. Rule Evaluation, Aggregation of the rule output and Defuzzification of the output

value.

Page 15: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

130

Figure 6.12: Rule Evaluation Aggregation of the rule output

Figure 6.13: Surface Viewer for Network Traffic Problem

Page 16: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

131

The table 6.1 is the Rules for FIS. The figure 6.11 is the Rules Editor for Network

Traffic Problem. The figure 6.12 is the Rule Evaluation Aggregation of the rule output.

The figure 6.13 is the Surface Viewer for Network Traffic Problem.

6.7 Result

The above module is integrated with the algorithm. The code is generated with help of

MATLAB Compiler. The Implementation is made to run on existing websites and is

compared with existing web crawlers.

Page 1 Page 2 Page 3 Total Load in KB

visit 1 185 185 185 555

visit 2 193 196 195

visit 3 188 189 199

visit 4 200 201 205

visit 5 188 199 188

load caused 954 970 972 2896

visit 6 188 189 188

visit 7 198 198 189

visit 8 178 176 189

visit 9 189 187 189

visit 10 199 189 198

load caused 1906 1909 1925 5740

Table 6.2: Load caused using Conventional Crawler

Page 1 Page 2 Page 3 Total Load in KB

visit 1 78 87 98 263

visit 2 87 89 98

visit 3 76 98 98

visit 4 87 98 87

visit 5 87 98 89

load caused 415 470 470 1355

visit 6 87 89 87

visit 7 78 98 98

visit 8 98 76 98

visit 9 87 97 98

visit 10 78 98 87

load caused 843 928 938 2709

Table 6.3: Load caused using Single threaded Crawler

Page 17: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

132

Page 1 Page 2 Page 3 Total Load in KB

visit 1 35 35 37 107

visit 2 36 37 37

visit 3 43 36 45

visit 4 34 45 57

visit 5 34 43 43

load caused 182 196 219 597

visit 6 43 53 43

visit 7 43 34 34

visit 8 45 54 43

visit 9 34 43 45

visit 10 34 34 45

load caused 381 414 429 1224

Table 6.4: Load caused using Agent Based Crawler

Page 1 Page 2 Page 3 Total Load in KB

visit 1 23 23 24 70

visit 2 24 24 24

visit 3 24 28 27

visit 4 27 26 27

visit 5 24 27 27

load caused 122 128 129 379

visit 6 27 27 27

visit 7 26 26 26

visit 8 26 26 27

visit 9 27 25 27

visit 10 25 26 27

load caused 253 258 263 774

Table 6.5: Load caused using Migrating Parallel Web Crawler

Page 18: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

133

Figure 6.14: Graph showing network load caused in various approaches

The table 6.2 is the Load caused using Conventional Crawler. The table 6.3 is the Load

caused using Single threaded Crawler. The table 6.4 is the Load caused using Agent

Based Crawler. The table 6.5 is the Load caused using Migrating Parallel Web Crawler.

The figure 6.14 is the Graph showing network load caused in various approaches. To

analyze and compare the approaches three websites are taken. The average size of a

HTML page was 205 KB so the network traffic generated using traditional centralized

crawling approach was 555 KB. Whereas in our approach the pages were compressed at

the server side and then the traffic load found was 70 KB. It can be observed that, after

five visits to the pages the load incurred has been found 2896 KB, 1355 KB, 597KB and

379 KB respectively and after ten visits the load was 5740 KB, 2709 KB, 1224 KB and

774 KB respectively as shown in the above figure. Moreover this result in network

traffic reduced.

6.8 Conclusion

In this chapter, discussion on the crawling process is carried out using either of the

following approaches: Crawlers can be generously allowed to communicate among

themselves or they cannot be allowed to communicate among themselves at all, both

approaches put extra burden on network traffic. Here a fuzzy logic based algorithm is

Page 19: CHAPTER 6 SOLUTION TO NETWORK TRAFFIC PROBLEM IN …shodhganga.inflibnet.ac.in/bitstream/10603/67170/7/chapter- 6.pdf · SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING PARALLEL

134

proposed and it is implemented in MATLAB using fuzzy logic tool box which predict

the load at particular node and route of network traffic. The experimental results show

that in case of Migrating Parallel web crawler the network load is reduced.