Automatic classification of Web resources using Java and Dewey Decimal Classification

3
CowlpuTER rET#mKs i?ifDN SYS-IEMS Compu~r Networks and ISDN Systems 30 I IYYX) 646-h-W Short Paper Automatic classification of Web resources using Java and Dewey Decimal Classification Abstract The Wolverhampton Web Library’ (WWLib) is a World Wide Web search engine that provides access to UK based intinmation. The experimental version, developed in IYY5. was a success but highlighted the need for a much higher degree of automation. An interesting feature of the experimental WWLib was that it organised information according to Dewey Decimal Classitication (DDC) [I]. This paper discusses the advantages of classitication and describes the automatic classitier that is being developed in Java as part of the new. fully automated WWLih. 1~‘; 199X Puhlishcd h? Elxvicr Science B.V. ,411 riftIt u24cnccI. /G,~II~oI.&: Search; Retrieval; Classification 1. Introduction The advantages of document clustering and classiii- cation over keyword based indices have been debated in Information Retrievjal (IR) research for quite some time. Documents that share the same frequently occurring keywords and concepts are usually relevant to the same queries. Clustering such documents together enables them to be retrieved together more easily and helps to avoid the retrieval of irrelevant unrelated information. Another advantage is that classitication usually enables the abil- ity to brovvse through a hierarchy of logically organised information which is often considered a more intuitive process than constructing a query string. Keyword indices are however comparatively simple to construct automat- ically. Consequently, classification is usually associated with human detined metadata or catalogue entries, The evolution of automated World Wide Web search engines tram manually maintained classified lists and di- ’ C‘orre~ptding author. E-mail: cx I753Cn;wlv.ac.uk ’ htrp://wuw.s~it.wl\.ac.tth/wwlih/ rectories has further demonstrated the strengths and weak- nesses of these two approaches. The tendency of automated search engines to inundate users with irrelevant results has prompted reconsideration of the merits of classitication. The combination of automation and classitication has the potential to provide an accurate. intuitive, comprehensive classitied search engine. This is the aim of WWLib. 2. WWLib The original version of WWLib relied to a large degree on manual maintenance and as such can best be described as a classifed directory that was organised according to DDC. The use ot’ DDC to organise WWLib evolved from the notion that Library Science has a lot to offer the chaotic task of information resource discovery on the Web. The classitied nature of WWLib was heneticial in that it clustered documents according to subject matter and enabled users to browse through documents that shared the same DDC classmark as those that appeared in the results of a query.

Transcript of Automatic classification of Web resources using Java and Dewey Decimal Classification

Page 1: Automatic classification of Web resources using Java and Dewey Decimal Classification

CowlpuTER rET#mKs i?ifDN SYS-IEMS

Compu~r Networks and ISDN Systems 30 I IYYX) 646-h-W

Short Paper

Automatic classification of Web resources using Java and Dewey Decimal Classification

Abstract

The Wolverhampton Web Library’ (WWLib) is a World Wide Web search engine that provides access to UK based intinmation. The experimental version, developed in IYY5. was a success but highlighted the need for a much higher degree of automation. An interesting feature of the experimental WWLib was that it organised information according to Dewey Decimal Classitication (DDC) [I]. This paper discusses the advantages of classitication and describes the automatic classitier that is being developed in Java as part of the new. fully automated WWLih. 1~‘; 199X Puhlishcd h? Elxvicr Science

B.V. ,411 riftIt u24cnccI.

/G,~II~oI.&: Search; Retrieval; Classification

1. Introduction

The advantages of document clustering and classiii- cation over keyword based indices have been debated in Information Retrievjal (IR) research for quite some time. Documents that share the same frequently occurring keywords and concepts are usually relevant to the same queries. Clustering such documents together enables them to be retrieved together more easily and helps to avoid the retrieval of irrelevant unrelated information. Another advantage is that classitication usually enables the abil- ity to brovvse through a hierarchy of logically organised information which is often considered a more intuitive process than constructing a query string. Keyword indices are however comparatively simple to construct automat- ically. Consequently, classification is usually associated with human detined metadata or catalogue entries,

The evolution of automated World Wide Web search engines tram manually maintained classified lists and di-

’ C‘orre~ptding author. E-mail: cx I753Cn;wlv.ac.uk ’ htrp://wuw.s~it.wl\.ac.tth/wwlih/

rectories has further demonstrated the strengths and weak- nesses of these two approaches. The tendency of automated

search engines to inundate users with irrelevant results has prompted reconsideration of the merits of classitication. The combination of automation and classitication has the potential to provide an accurate. intuitive, comprehensive classitied search engine. This is the aim of WWLib.

2. WWLib

The original version of WWLib relied to a large degree on manual maintenance and as such can best be described as a classifed directory that was organised according to DDC. The use ot’ DDC to organise WWLib evolved from the notion that Library Science has a lot to offer the chaotic task of information resource discovery on the Web. The classitied nature of WWLib was heneticial in that it clustered documents according to subject matter and enabled users to browse through documents that shared the same DDC classmark as those that appeared in the results of a query.

Page 2: Automatic classification of Web resources using Java and Dewey Decimal Classification

017

Fig. I, Overview of the new WWLib architecture.

It was soon evident. however, that WWLib required a much higher degree of automation. A robot for resource discovery and an automatic indexer were required but the automated WWLib would preserve its classified nature by employing an automatic classifier. An outline design of

the new automated WWLib, shown in Fig. I, identifies the automated components and their responsibilities:

There are six automated components: ( I ) A Spider that automatically retrieves documents from

the Web; (2) An Archiver that receives Web pages from the spider,

stores a local copy. assigns to it a unique accession number and generates a new metadata template. It also distributes local copies to the Extractor. Classifier and Builder and adds subsequent metadata generated by the Classitier and the Builder to the assigned metadata template;

(3) An Extractor that analyses pages. provided by the archiver for embedded hyperlinks to other documents. If found, URLs are passed to the archiver where they are evaluated to check that they are pointing to locations in the UK. before being passed to the Spider;

(4) A Classifier that analyses pages provided by the archiver and generates DDC classmarks:

(51 A Builder that analyses pages provided by the archiv,er and outputs metadata which is stored by the archiver in the document’s met&data template and is also used to build the index database that will be used to quickly associate keywords with document accession numbers:

(61 A Searcher that accepts query strings from the user. uses them to interrogate the index database built by the builder. uses the resulting accession numbers to

retrieve the appropriate metadata templates and local document copies and then uses all this information to generate detailed results, ranked according to rele- vance to the original query.

One of the reasons for deciding on such a conrpo- nentised architecture was to allow for components to be distributed over a network if necessary.

3. The classifier

Many automated search engines have deployed tradi- tional IR indexing strategies and retrieval mechanisms but very few have experimented with automatic classi- fication. Previous experimentation with automatic classi- tication was carried out during the development of the original WWlib. This original classitier 121 compared text in each document with entries in a DDC fhesu~~nrs file. The thesaurus entries consisted of the DDC classmarh and accompanying header text, e.g. 631.568 C’oo~irt,~ fi~r .spPL.iu/ cKxYLsior~.s I/7c~/lrciin<~ C/7ri.srr?7rls

The original classifier achieved approximately 40 per- cent accuracy. For the new version of the classitier. it was decided that a much more detailed thesaurus with a long list of keywords and synonyms for each classmarl\ was required. These lists are referred to as class repre- sentatives. It was also decided that more use would be made of the hierarchical nature of DDC. The classitier would begin by matching documents against very broad class representatives representing each of the ten DDC classes at the top of the hierarchy - 000 Generalities, 100 Philosophy. paranormal phenomena and psychology. 200 Religion. 300 Social sciences. 300 Lunguape. 500 Natural

Page 3: Automatic classification of Web resources using Java and Dewey Decimal Classification

sciences and mathematics. 600 Technology. 700 The arts. 800 Literature and rhetoric. 900 Geography. history. and auxiliary disciplinea.

The matching process would then proceed recursively down through the subclnsse\ of those DDC classes that were t’ound to have ;I significant measure of similarity with the document. A liltering effect is achieved usin? cuhtomiaed ctasz reprchentatives at each node. Ambiguous terms XX concealed within lower nodes of the classitictl- tion hirral-thy enabling them to be considered in context.

4. Design and implementation

The classifier has two main processes; firstly the docu- ment ia indexed: secondly the document is classified. The indexing process results in the formation of a document object. The document object comprises a number of key- word ob.jects, each one representing a word found within the document. Keywords have a weight - assigned ac- cording to where the word was found - and a position associated with them.

The ctaasitication process useh a classify object which takes the newly formed document object and compares it with a number of DDC objects. DDC objects inherit their structure and behuviour from an abstract class finked. They too arc made up of ;I series of weighted keyword ob- jcct\ that together mahc up the class representative. Each DDC object has ;I classmurk object specifying its dewey decimal classmark. and can have up to ten subclasses which are in themselves DDC objects representing the next layer ot’ the hierarchy. The classify object begins by comparing the document object with the ten DDC objects representing the top of the DDC hierarchy. If the docu- ment matches significantly with a DDC object. instances of that DDC object.5 subclasses are created and the doc- ument ih compared with those. This process continues rccurGvely down the hierarchy until n significant match

is found with a leaf node (a DDC object with no hubclass- es). In this event the classmark object belonging to the DDC object is copied into the document object. Measures of similarity are calculated usin p the Dice Coefficient 131. The indexing and classification processes arc co-ordinated by the Ace (Automatic classification engine) object.

The classifier has been implemented in Jaw. This has enabled easy networking. multithreading and memory management.

5. Conclusion

The new classifier is in the early stage> of evaluation. It appears. however, that use of a hierarchical classifier results in context sensitive classifications. The use of man- ually defined class representatives. that perform context sensitive filtering. encourage accuracy. To increase the ac- curacy of the classitier further ;I more comprehensive bet of DDC class representatives are required. When sufti-

cient DDC classes have been defined, formal testing wilt be required to prove that the new classitier is achieving a higher rate of accurate classitications than the original one. There is ;I working paper’ that describe& the design and implementation of the classifier in more detail.

References

[ 11 L. Mai Ghan, J.P. Comaromi. J.S. Mitchell and M.P. Satija. Lhq fkimc~l Cl~r.s.ti/~c~~~tior,: A l’r~rc~licul Gf~itli~. Forest

Press. ISBN 0-91060X-F-5. IYY6. [ 21 J. Wallis. F? Burden. Towards a ~la~aiticatioll-baled approach

ttr rc\ourcc’ discovery on the wch. Ltniversity of Wdvrrhamp-

ton. IYYS. htt~~://www,s~it.wlv.ac.uL/wwlih/pobirition.htt~~t

[ 3 I R.C.J. van Rijshcrgrn. Ir!f;~~~trtrorr Krrriei,rrl: .Srco~~/ Edi- /UHI tChapter 3. http:llwww.~lc\.~l~~~~~w .uc.ul\/Keith/Chapte

r.3Kh.3.html). Buttrrworths. London. ISBN O--LOX- 10775-X. 10x1.

’ Automatic clahitication d Web rex)urccs using Jn\a and Dewey Decimal Clashitication. working paper. http://uwu .\cit. wlv.ac.ukl-cn I253/clahhifier/