[IEEE 2008 IEEE International Conference on System of Systems Engineering (SoSE) - Monterey, CA, USA...

978-1-4244-2173-2/08/$25.00 ©2008 IEEE

Information Mining System Design and Implementation Based on Web Crawler

Shan Lin, You-meng Li, Qing-cheng Li College of Information Technical Science

Nankai University Tianjin, 300072, CHINA

E-mail : [email protected], [email protected], [email protected]

Abstract –With the information explosion causing by the World Wide Web in recent years, the issue of how to execute the enormous information efficiently at a reasonable lost has become the concern of information providers, service agencies and end users. When many research focus on how to design an efficient web crawler, we pay our attention to how to make the best of the result of web crawler. In this paper, we describe the design and implementation of an information mining system running on the results of web crawler to gain more metadata from unstructured documents for focused search (such as RSS search). We present the software architecture of the system, describe efficient techniques for achieving high performance and report preliminary experimental results to prove that this system can address the issue of robustness, flexibility and accuracy at a low cost.

Keywords: Crawler, information mining, RSS, low cost.

1 Introduction The explosive growth of the World Wide Web gives

people a magic change of their life styles and working manners. A study released in 2003 [1] showed that the volume of information on the Web, which is accessible directly, is about 167 terabytes, consisting about 2.5 billion pages. According to the latest survey [2], by December 2007, the total of netizens in the world had increased to 1,320 million, with a sharp increase of 265.6%. Although exponentially increasing amounts of material are available, finding and making sense of this material is potentially useful, but difficult with present search technology, How to make the best of the huge data and manage the documents on the Internet efficiently become a very important task to information providers and web service agencies.

Our overall aim is to design a feasible and flexible distributed information mining system, which can make the best of the metadata result from web crawlers, maximize the benefits obtained per downloaded page and get more by-products at a comparatively low cost. We implement the system architecture on the basis of a simple breadth-first crawler called ‘Web Spider’, although the system can be adapted to other strategies. We report preliminary experimental results in section 3, and the conclusion and

direction for future work will be presented at the end of this paper.

1.1 Information Mining Web information mining technique is a special

expanded application of data mining techniques on managing the huge information on the Internet. Web information mining is the process of scratching the metadata from the Internet, analyzing from different perspectives and summarizing it into useful information. It includes information extraction, information retrieval, natural language processing and document summarization.

Information mining can adopt some data mining techniques, but there are significant differences between them. Information mining works with unstructured data, such as Web pages and text documents, in contrast to Data Mining which is based on structured data like relational data.

1.2 Web Crawler The huge size of data on the Internet give the birth of

web search engines, which are becoming more and more indispensable as the primary means of locating relevant information. Such search engines rely on massive collections of web pages that are acquired by the work of web crawlers, also known as web robots or spiders.

A web crawler is a program, which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used for automating maintenance tasks by scratching information automatically. Typically, a crawler begins with a set of given Web pages, called seeds, and follows all the hyperlinks it encounters along the way, to eventually traverse the entire Web [3]. General crawlers insert the URLs into a tree diagram and visit them in a breadth-first manner. There has been some recent academic interest in new types of crawling techniques, such as focused crawling based on semantic web[6,8], cooperative crawling [10], distributed web crawler [7], and intelligent crawling [9],and the significance of soft computing comprising fuzzy logic (FL), artificial neural networks (ANNs), genetic algorithms (GAs), and rough sets (RSs)

highlighted [11].

The behavior of a web crawler is the outcome of a combination of policies:

A selection policy that stated which pages to download.

A re-visit policy that states when to check. A politeness policy that states how to avoid

overloading websites. A parallelization policy that states how to coordinate

distributed web crawlers. [4]

In the rid of repeated operation, crawlers need make a record of the web pages which have been downloaded by Hash Table. That means after crawling search engines store numerous pages in their databases. The harder task is that the crawling and storing work should repeat in a certain period. Taking the most popular search engine Google as an example, in 2003, Google’s crawler crawled in every month, but now, crawls every 2 or 3 days. So crawling on the massive pages in such frequency, the cost of net resource and storage is huge. It is exactly the motivation of this paper that since we have to run a crawler to fetch numerous pages of data at a enormous cost of machine-hour and storage, why don’t we take full advantage of it and try to get more useful information in the form of metadata which is data about data ?

2 RDF and RSS This paper describes the design and implementation of

an optimized distributed information mining system, taking the application of scratching RSS (Really Simple Syndication) Feed from net as an example which is on the basis of RDF.

The Resource Description Framework (RDF) is a general-purpose language for representing information in the Web. This document defines an XML (Extensible Markup Language) syntax for RDF called RDF/XML in terms of Namespaces in XML, the XML information Set and XML Base. [5] RDF allows for representation of rich metadata relationships beyond what is possible with earlier flat-structured RSS.

The Really Simple Syndication (RSS) is a standard format to descript and syndicate the web information. It is a lightweight XML format designed for sharing headlines and handing other web content syndication, which is widely used in Internet news, Blog and Wiki. RSS is a format used to index information and metadata. For instance, not all the Internet news’ content is always free. But the metadata of the articles is usually shared, such as title, author, link and abstract. So RSS become the information platform of these metadata, and we can regard RSS as an efficient way to get and share web information. Figure 1 as above shows the main tags of standard format of the RSS 2.0 document. By subscribing the RSS feeds,

Figure 1. RSS 2.0 main tag tree representation.

you can receive the newest information without any operation. That is the most important character of RSS – Syndication and Aggregation. So RSS has already become the most popular application of XML.

Because RSS follow the XML standard format, we can parse RSS Seed documents by the DOM (Document Object Model). The process of certify a RSS document should be divided into two steps as follow:

1) The head of the document follow the RSS format. 2) The document can be set DOM and parsed

successfully. The detailed implementation will be present at Section 3.

3 Design Overview 3.1 Assumptions

In designing a web information mining system for our needs, we have been guided by assumptions that offer both challenges and opportunities, which are under guidance of some preliminary observation.

The information mining system should store huge data and numerous files temporarily. As the limitation of experiment instruments, we need not consider the limit of storage.

As the limitation of bandwidth, we set the longest response time for downloader to ensure the system can run continually and normally. But the overtime will reduce the scratching speed. So high sustained bandwidth is more important than low latency.

The system should be built from several components. Since it is not the key to solve in this paper, we don’t consider the problem of tolerating and recovering.

3.2 Architecture This Information Mining System consists of four major kinds of components – Crawler, Information Mining Machine, Filter and Downloader as shown in Figure 2. Each of these is typically a commodity computer running a ser-level server process.

Figure 2. System architecture

In the system, Crawler is used to scratching all kinds of web pages such as html, xml, asp, jsp and so on from a set of seed pages. The output of Crawler is formatted in the attributes of number, URL, Text (abstract information about URL). Since the crawler is not essential for our experimental setup, we won’t introduce the algorithm and detailed implementation of crawler in this paper. Note that we only parse for hyperlinks, and not for indexing terms by ‘Web Spider’, which would significantly slow down the application. Then the data will be send into Mining Machine to process which is the key component of the system with the help of Filter. The detailed implementation will be described in the following section. At last Downloader take the charge of downloading the web pages following the list from Information Mining Machine, scratch the metadata and store in the server database. In order to achieve high-performance which means download hundreds or even thousands of pages per second, the design of the cluster of Downloaders is quite important. For system flexibility consideration, the number of the Downloader is not fixed. That means we can insert downloaders into the system as we need to adapt to different experiment conditions and applications with a reasonable amount of work. Before downloading, the system can detect the number of the downloaders automatically, and the items in the output list.

To guarantee the accuracy of Information Mining system, after downloading the page file successfully, the Downloader checks the file again to make sure that it is a valid RSS feed. As all the work of parsing a XML file can be implemented by set a DOM. So we can judge a RSS file in the manner of checking whether it can be structured as a valid DOM structure. At the same time, the system scratches the metadata such as title, link, date and so on from DOM interfaces and stores in the database.

4 Information Mining Machine The mining machine component traverses the items

listed in the file ‘link.txt’ in the data flow, which is

implementated in C++. It is convenient to scratch the link we need by regular expression. For example, RSS is a special XML file, a XML application, conforms to the W3C’s RDF Specification and is extensible via XML-namespace and/or RDF based modularization [12]. So we define the regular expression ending by ‘.xml’ at first:

Exp(RSS)={,(.*)(?=\.xml),} (1)

After some experiments, we find that : 1) Some web pages (html, xml, asp, jsp, php…) are directed by their servers to jump from a non-RSS link to a RSS link automatically. 2) Some URL directory jump to a RSS link directly. For example, the URL as followed actually points at a RSS file about news.

http://www.unicornblog.cn/rss2.asp

Although it seems to be an asp web page, it is actually directed to a RSS file acquiescently. So if only scratch XML files, we will miss a lot of RSS seeds. Then we redefine the regular expression as follows:

Exp(RSS)={,(.*)[(?=\.xml)|(?=\.asp)|(?=\.jsp)|(?=\.php)],} (2)

If the URL’s format is tally with the regular expression (2) as above, the information mining machine insert it to the list of potential handling targets. Then this handling-list will be sent to the Filter through the data flow simultaneously.

Experientially, executing time always in the linear growth, because all the work should be done by traversing the whole document, and parsing it on the different detailed level.

Here, the challenge is to avoid traversing and over-parsing as far as possible. Thus in our system, we design the component called Filter to co-operate with information mining machine, which is in charge of dealing with the problem. Before fetching the valuable information hidden in unstructured web pages, the Filter of our system will pre-inspect these documents, send metadata to the Information Mining Machine which is most possibly to be a RSS file, and which is impossible. At first, the Filter download files to the system cache and read only 50 bits of each page related to the link from the Mining Machine, then check out whether these 50 bits data follow the standard RSS 1.0 (more details of the RSS 1.0 refer to [13]). In RSS 1.0, all the RSS files begin with the following format:

<?xml version="1.0" encoding="utf-8"?>

Of course, there are some other coding standard such as GB2312, UTF-16. We still use regular expression to check the beginning 50 bits of the files whether it follows RSS 1.0 standard. If the result is TRUE, the Filter returns the link of the page to the Information Mining Machine, if

not, this link will be flitted out without no more unnecessary operation.

5 Experimental Result and Analysis We present the preliminary experimental results and

experience here and do some brief analysis on it. A detailed analysis of performance bottlenecks and scaling behavior is beyond the scope of this paper, and would require a fast simulation tested, since it would not be possible (or appropriate) to do such a study with our current Internet

5.1 Experimental Result on Step 1 Since RSS is widely used in web news, Blog, Wiki and

so on, our experimental initializing Seed Link for the Crawler should cover as many kinds of thses aspects as possible. Because of our experimental condition, the scope we covered on the Internet is very limited. So a ‘right’ seed link is significant which can keep the system running more efficiently. As our analysis, a seed link page which is full of links can increase the mining hit rate.

On Step 1, we choose the following URLs as the seed link of the Crawler running respectively in comparison :

Blogged.com : A popular Blog discovery site .

Techcrunch : One of the most famous weblog .

5.2 Experimental Result on Step 2 We choose the link ‘http://findarticles.com/p/articles

/?sm=rss’ of BNET which is pointed to the page of a RSS resource map site and full of Internet news, by the step 1 of

TABLE 1. Statistical results of system on step 1

Blogged Techcrunch

Visiting manne Breadth-first Breadth-first

Height of tree-structure

5 5

downloader 1 1

Run time 12 12

Link.txt ruquest 40100 52914

Read-timeout-exceeded errors

2396 2675

‘.xml’-ending files 41 5

Directory /jump automatically

190 455

Valid RSS Seeds 231 460

the experiment. After three Downloaders running 100 hours, the number of hyperlinks in ‘link.txt’ request list is 105025, including 101872 valid URLs. The trend of the speed of RSS information mining executed by one of the Downloaders is shown in Figure 3. The graph in Figure 3 reveals that the number of valid RSS Seeds scratched by Information Mining Machine approximately presents a linear groeth with the excuting time. And the flat part of the trend is related to the link structure of the web site.

At last we scratch 2312 RSS Feeds, after send to Filter, there are 2007 valid RSS Feeds. The harvestrate is about 0.3345 per minute which is limited by the brandwedth.

6 Future Work We have described the Information Mining System, a

distributed system for finding out valuable structured metadata hidden in the thousands of millions of unstructured web documents. In addition, we present preliminary experiments along with some brief analysis. In this Information Mining System, there are obviously some improvements can be made. A major open issue for future work is a detailed solution to increase the harvest rate of our information mining system. Although the harvest rate is tightly related to the brandwidth, we can optimize the system architecture to improve. As complete web crawling coverage cannot be achieved, due to the vast size of the whole internet and to resource availability, our system can’t scratch all the RSS Feeds. So how to increase the coverage rate is another task.

For the future work, we will monitor the RSS Seeds, set measurement standards such as life cycle and fresh condition which were just like the measurement of the real seeds in the nature world. It will be a completely new idea about RSS Seeds, but absolutely necessary to handle the millions of RSS Seeds. In addition, we will improve the Downloader conponent by the way of supervised learning to increase the harvest rate of RSS scratching. By some guidance self-leared from sample data, the downloader can judge easily to download pages selectively. Last but not

Figure 3. Scratching trend

the least, in order to manage these RSS Seeds we get from the mining System efficiently, the way of evaluating should be considered. The improvements above all will make this system more realistic reliable and friendlier to users.

References [1] UC Berkeley’s School of Information Management and System. “How much information? 2003”. http://www2.sims. berkeley.edu/research/projects/how-much-info-2003/execs-um.htm, 2003.

[2] Internet World Stats, “Usage and Population Statistics”, http://www.internetworldstats.com, Dec. 2007.

[3] Sankar K. Pal. Varun Talwar. Web Mining in Soft Computing Framework: Relevance, State of the Art and Future Directions. Proceedings of IEEE Transactions on Neural Networks, Vol. 13, No. 5, September 2002.

[4] Vladislav Shkapenyuk and Torsten Suel. Design and Implementation of a High-Performance Distributed Web Crawler. Proceedings of the 18th International Conference on Data Engineering, 2002.

[5] W3C, “RDF/XML Syntax Specification(Revised)”, http://www.w3.org/TR/rdf-syntax-grammar/, Feb. 2004.

[6] Vincezo Loia, Witold Pedrycz, and Sabrina Senatore. “Semantic Web Content Analysis: A Study in Proximity-Based Collaborative Clustering.” Proceedings of the 18th International Conference on Data Engineering (ICDE’02), 2002.

[7] Vladislav Shkapenyuk, Torsten Suel. “Design and Implementation of a High-Performance Distributed Web Crawler,” Proceedings of ICDE’02, 2002.

[8] C. C. Aggarwal, F. Al-Garawi, and P. S. Yu.” Intelligent crawling on the world wide web with arbitrary predicates,” Proceedings of the 10th International World Wide Web Conference, May 2001.

[9] Sengor Altingovde and Ozgur Ulusoy. “Exploiting interclass rules for focused crawling,” IEEE Intelligent System, 2004.

[10] Marina Buzzi. “Cooperative crawling,” Proceedings of the First Latin American Web Congress (LA-WEB 2003), 2003

[11] Sankar K. Pal. Varun Talwar. “Web Mining in Soft Computing Framework: Relevance, State of the Art and Future Directions,” Proceedings of IEEE Transactions on Neural Networks, Vol. 13, No. 5, Sep. 2002.

[12] S.Chakrabarti. “Mining the web: discovering knowledge from hypertext data,” Morgan Kaufmann, 2003.

[13] W3C, “RDF Site Summary (RSS) 1.0,“ http://web. resource.org/rss/1.0/spec, 2000.

[IEEE 2008 IEEE International Conference on System of Systems Engineering (SoSE) - Monterey, CA, USA...

Documents

Transcript of [IEEE 2008 IEEE International Conference on System of Systems Engineering (SoSE) - Monterey, CA, USA...