Calado03web R

12
The Web-DL Environment for Building Digital Libraries fromthe Web avel P. Calado Marcos A. Gonc ¸alves Edward A. Fox Berthier Ribeiro-Neto Alberto H. F. Laender Altigran S. da Silva Davi C. Reis Pablo A. Roberto Monique V. Vieira Juliano P. Lage Federal University of Minas Gerais Dep. of Computer Science 31270-901, Belo Horizonte, MG, Brazil pavel, alti, berthier, laender, palmieri, davi, pabloa, monique @dcc.ufmg.br Virginia Tech Dep. of Computer Science Blacksburg, VA 24061, USA mgoncalv, fox @vt.edu Federal University of Amazonas Dep. of Computer Science 69077-000, Manaus, AM, Brazil [email protected] Abstract The Web contains a huge volume of unstructured data, which is difficult to manage. In digital libraries, on the other hand, information is explicitly organized, described, and managed. Community-oriented services are built to at- tend specific information needs and tasks. In this paper, we describe an environment, Web-DL, that allows the con- struction of digital libraries from the Web. The Web-DL en- vironment will allow us to collect data from the Web, stan- dardize it, and publish it through a digital library system. It provides support to services and organizational structure normally available in digital libraries, but benefiting from the breadthof the Web contents. We experimented with ap- plying the Web-DL environment to the Networked Digital Library of Theses and Dissertations (NDLTD), thus demon- strating that the rapid construction of DLs from the Web is possible. Also, Web-DL provides an alternative as a large- scale solution for interoperability between independent dig- ital libraries. 1. Introduction The Web contains a huge volume of information. Al- most all of it is stored in the form of unstructured data and is, therefore, difficult to manage. Access to the infor- mation is granted through browsing and searching, which normally involves no assumptions about the users’ tasks or their specific information needs. On the other hand, we have databases, where data has a rigid structure and services are provided for specialized users. Digital libraries (DLs) stand in the middle. We can say that DL users have broader in- terests than database users, but more specific interests than regular Web users. Also, within DLs information is explic- itly organized, described, and managed—targeted for com- munities of users with specific information needs and tasks, but without the rigidness of database systems. In this paper we present Web-DL, an environment that allows the construction of digital libraries from the Web. Web-DL allows us to collect data from Web pages, normal- ize it to a standard format, and store it for use with digital library systems. By using standard protocols and archival technologies, Web-DL enables open, organized, and struc- tured access to several heterogeneous and distributed digi- tal libraries and the easy incorporation of powerful digital library and data extraction tools. The overall environment thus supports services and organization available in digital libraries, but benefiting from the breadth of the Web con- tents. By moving from Web to DL we are providing quality ser- vices for communities of users interested in specific domain information. Services like searching over several different DLs, browsing, and recommending are made available with high quality, since we reduce the search space, restricting it to the data related to the users’ interest, and structuring and integrating such data through canonical metadata standards. We demonstrate the feasibility of our approach by imple- menting the proposed environment for a digital library of electronic theses and dissertations (ETDs), in the context of the Networked Digital Library of Theses and Dissertations (NDLTD). The NDLTD currently has over 160 members

Transcript of Calado03web R

Page 1: Calado03web R

The Web-DL Environment for Building Digital Libraries from the Web

Pavel P. Calado�

Marcos A. Goncalves�

Edward A. Fox�

Berthier Ribeiro-Neto�

Alberto H. F. Laender�

Altigran S. da Silva�

Davi C. Reis�

Pablo A. Roberto�

Monique V. Vieira�

Juliano P. Lage�

�Federal University of

Minas GeraisDep. of Computer Science

31270-901, Belo Horizonte,MG, Brazil�

pavel, alti, berthier, laender,palmieri, davi, pabloa,

monique � @dcc.ufmg.br

�Virginia Tech

Dep. of Computer ScienceBlacksburg, VA 24061, USA�

mgoncalv, fox � @vt.edu

�Federal University of

AmazonasDep. of Computer Science69077-000, Manaus, AM,

[email protected]

Abstract

The Web contains a huge volume of unstructured data,which is difficult to manage. In digital libraries, on theother hand, information is explicitly organized, described,and managed. Community-oriented services are built to at-tend specific information needs and tasks. In this paper,we describe an environment, Web-DL, that allows the con-struction of digital libraries from the Web. The Web-DL en-vironment will allow us to collect data from the Web, stan-dardize it, and publish it through a digital library system.It provides support to services and organizational structurenormally available in digital libraries, but benefiting fromthe breadth of the Web contents. We experimented with ap-plying the Web-DL environment to the Networked DigitalLibrary of Theses and Dissertations (NDLTD), thus demon-strating that the rapid construction of DLs from the Web ispossible. Also, Web-DL provides an alternative as a large-scale solution for interoperability between independent dig-ital libraries.

1. Introduction

The Web contains a huge volume of information. Al-most all of it is stored in the form of unstructured dataand is, therefore, difficult to manage. Access to the infor-mation is granted through browsing and searching, whichnormally involves no assumptions about the users’ tasks ortheir specific information needs. On the other hand, we havedatabases, where data has a rigid structure and services are

provided for specialized users. Digital libraries (DLs) standin the middle. We can say that DL users have broader in-terests than database users, but more specific interests thanregular Web users. Also, within DLs information is explic-itly organized, described, and managed—targeted for com-munities of users with specific information needs and tasks,but without the rigidness of database systems.

In this paper we present Web-DL, an environment thatallows the construction of digital libraries from the Web.Web-DL allows us to collect data from Web pages, normal-ize it to a standard format, and store it for use with digitallibrary systems. By using standard protocols and archivaltechnologies, Web-DL enables open, organized, and struc-tured access to several heterogeneous and distributed digi-tal libraries and the easy incorporation of powerful digitallibrary and data extraction tools. The overall environmentthus supports services and organization available in digitallibraries, but benefiting from the breadth of the Web con-tents.

By moving from Web to DL we are providing quality ser-vices for communities of users interested in specific domaininformation. Services like searching over several differentDLs, browsing, and recommending are made available withhigh quality, since we reduce the search space, restricting itto the data related to the users’ interest, and structuring andintegrating such data through canonical metadata standards.

We demonstrate the feasibility of our approach by imple-menting the proposed environment for a digital library ofelectronic theses and dissertations (ETDs), in the context ofthe Networked Digital Library of Theses and Dissertations(NDLTD). The NDLTD currently has over 160 members

Page 2: Calado03web R

among universities and research institutions, providing sup-port for the implementation of DL services using standardprotocols, but is deficient in dealing with members that pub-licize their ETDs only through the Web. Fortunately, our ap-proach matches with the growing tendency among sites thatpublish ETDs to create a Web page for each ETD, contain-ing all the relevant data (or metadata). Using our proposal,we will be able to add such ETDs to the NDLTD collectionwith little user effort.

The Web-DL environment builds upon tools and tech-niques for collecting Web pages, described in [10], extract-ing semi-structured data, described in [6, 14], and manag-ing digital libraries, described in [12]. In this paper weshow how these tools are seamlessly integrated under Web-DL and extended to provide solutions for data normaliza-tion problems, usually found when extracting data from theWeb. Experiments performed in the context of the NDLTDconfirm the quality of the results reported in [4], now ob-tained with a more general solution and less user effort,since the data extraction process has been further auto-mated.

The rest of this paper is organized as follows. Sec-tion 2 discusses some related works. Section 3 presents anoverview of the architecture proposed for the Web-DL envi-ronment. Sections 4, 5, and 6 describe the main componentsof Web-DL, and the ASByE, DEByE and MARIAN tools,respectively. Section 7 presents our approach for the Webdata normalization problem. Section 8 shows an exampledigital library built using Web-DL. Finally, in Section 9 wediscuss some of the problems found and present our conclu-sions.

2. Context and related work

Digital libraries involve rich collections of digital ob-jects and community-oriented specialized services such assearching, browsing, and recommending. Many DLs arebuilt as federations of autonomous, possibly heterogeneousDL systems, distributed across the Internet [8, 17]. Theobjective of such federations is to provide users with atransparent, integrated view of their collections and infor-mation services. Challenges faced by federated DLs in-clude interoperability among different digital library sys-tems/protocols, resource discovery (e.g., selection of thebest sites to be searched), issues in data fusion (mergingof results into a unique ranked list), and aspects of qualityof data and services.

One such federated digital library is the Networked Dig-ital Library of Theses and Dissertations (NDLTD) [7], aninternational federation of universities, libraries, and othersupporting institutions focused on efforts related to elec-tronic theses and dissertations (ETDs). Although providingmany of the advantages of a federated DL, NDLTD has par-

ticular characteristics that complicate interoperability andtransparent resource discovery across its members. For in-stance, institutions are autonomous, each managing mostservices independently and not being required to report ei-ther collection updates or changes to central coordinators.Also, all NDLTD members do not (yet) support the samestandards or protocols. The diversity in terms of natural lan-guage, metadata, protocols, repository technologies, char-acter coding, nature of the data (structured, semi-structured,unstructured, multimedia), as well as user characteristicsand preferences make them quite heterogeneous. Finally,NDLTD already has many members and eventually will aimat supporting all those that will produce ETDs. New mem-bers are constantly added and there is a continuing flow ofnew data, as theses and dissertations are submitted.

In DL cases like NDLTD, there are basically three ap-proaches for interoperability and transparent resource dis-covery. They differ in the amount of standardization or ef-fort required by the DL [19], as follows:

� Federated services: In this approach to interoperabil-ity a group of organizations decide that their serviceswill be built according to a number of agreed uponspecifications, normally selected from formal stan-dards. The work of forming a federation is the effortrequired by each organization to implement and keepcurrent with all the agreements. This normally doesnot provide a feasible solution in a dynamic environ-ment such as the NDLTD.

� Harvesting: A difficulty in creating large federationsis increasing motivation. So, some recent efforts aimat creating looser groupings of digital libraries. Theunderlying concept is that the participants make somesmall efforts to enable some basic shared services,without specifying a complete set of agreements. Thebest example is illustrated by the Open Archives Ini-tiative (OAI) [16], which promotes the use of DublinCore as a standard metadata format and defines a sim-ple standard metadata harvesting protocol. Metadatafrom DLs implementing the protocol can be harvestedto central repositories upon which DL services can bebuilt. Particularly in the case of OAI, there is an initialimpedance for its implementation by some archivessince it involves small amounts of coding and build-ing of middleware layers, especially for local reposito-ries that sometimes do not match very well the OAIinfrastructure, such as, for example, those reposito-ries based on the Z39.50 protocol. Further, very smallarchives may lack staff resources to install and main-tain a server. Moreover, some archives will not takeany active steps to open their contents at all, makinggathering, the next approach, the only available option.

� Gathering: If the various organizations are not pre-

Page 3: Calado03web R

pared to cooperate in any formal manner, a base levelof interoperability is still possible by gathering openlyaccessible information. The best example of gatheringis via Web search engines. Because there is minimalstaff cost, gathering can provide services that embracelarge numbers of digital libraries, but the services areof poorer quality than those that can be achieved bypartners who cooperate more fully. This is mainly dueto the quality of the data that can be gathered, includingaspects of lack of structure and absence of provenanceinformation.

For NDLTD, a combination of federated search (for a smallnumber with Z39.50 support), harvesting (from institutionswho agree to use a set of standard protocols), and gathering(from institutions who cannot, or do not want to use suchprotocols) is the best solution.

Although the problem of quality with Web data is wellknown, many have collected data from the Web in orderto develop collections of suitable size for various DL-likesystems. The Harvest system, one of the first systems toapply focused gathering, had simple HTML-aware extrac-tion tools [3]. PhysNet [20], a project to collect Physicsinformation from the Web, still uses Harvest. The NewZealand digital library (http://www.nzdl.org) has been de-veloping collections since 1995 based on content distributedover the Internet. Recent enhancements to the Greenstonesystem provide additional support, but require the manualconstruction and programming of wrappers, called pluginsand classifiers [21]. On a different approach, the CiteSeersystem [18] collects scientific publications from the Weband automatically extracts citation information. The dataextraction process, however, is specific for identifying au-thor, title, citations, and other fields common to scientificpapers. Similarly, Bergmark [2] proposes the use of cluster-ing techniques to collect pages on scientific topics from theWeb, but does not approach the issue of how to extract rele-vant data from such pages. Nevertheless, these works showthat, with sufficient manual intervention, useful services canbe built with data from the Web.

In the following, we present the architecture of the Web-DL environment, which (1) combines harvesting and gath-ering to broaden the scope of interoperability in federateddigital libraries, and (2) provides a framework to integratea number of technologies, such as focused crawling, dataextraction, and digital library toolkits. Ultimately Web-DLprovides an infrastructure for building high-quality digitallibraries from Web contents. We illustrate the usefulnessof our approach by using the Web-DL environment to in-tegrate data from OAI and non-OAI compliant members ofNDLTD.

3. The Web-DL environment architecture

To build an archive from the Web, data must be collectedfrom Web sites and integrated into a DL system. This oper-ation has three main steps: (1) crawl the Web sites to collectthe pages containing the data, (2) parse the collected pagesto extract the relevant data, and (3) make the data availablethrough a standard protocol. Figure 1 shows the Web-DLenvironment and architecture for the integration and build-ing of a digital library from the Web.

Collecting Web pages with the target information is doneby using the ASByE tool, described in detail in Section 4.After providing ASByE with a simple navigation example,a Web crawler is created for the site. This crawler collectsall the relevant pages, leaving them available for data ex-traction.

Collected pages then must be parsed to extract the rel-evant data. This is accomplished by the DEByE tool, de-scribed in detail in Section 5. Given one or more examplepages, DEByE is able to create a wrapper for the site to becollected. The site pages are then parsed by DEByE gener-ated wrappers and the data is extracted and stored locally ina relational database.

In order to be used by most digital library systems (inour case, the MARIAN system [12]), data must be storedin a structured way, (e.g., MARC or XML), usually usingcommunity-oriented semantic standards (e.g., Dublin Core,or FGDC for geospatial data). In the work reported in thispaper, we use ETD-MS, a metadata standard for electronictheses and dissertations [1], which builds upon Dublin Core.Nonetheless, since data in Web sites is frequently in non-standard, non-structured formats, we need some normaliza-tion procedure. In Section 7, our approach to normalizethe extracted data is described. This approach presents amore general solution than the one proposed in [4], allow-ing Web-DL to be easily used in different domains.

After the data in ETD-MS format is stored, an OAIserver set up on top of the local database will make it avail-able to anyone using the OAI protocol for metadata har-vesting (OAI-PMH), in our particular case, available to theMARIAN system. The MARIAN system, described in Sec-tion 6, uses an OAI harvester to collect the metadata pro-vided by DEByE, extracted from the Web pages. This datais stored in a union archive, using MARIAN’s indexingmodules. DL regular services are made available to usersthrough the union archive created by MARIAN.

The following sections describe in detail all the mecha-nisms used to build the architecture proposed here.

Page 4: Calado03web R

Wrapper

OAIServer

UnionArchive

Services(search, browse, ...)

UserUser

WebETDSite

ASByE

DEByE

WebETDSite

User

MARIAN/NDLTD

...

...

HTML

HTML

ETD−MS

OAIHarvester

Indexer

CrawlerWeb

Figure 1. Proposed architecture for the Web-DL environment.

4. Obtaining pages from the ETD sites:the ASByE tool

In this section we describe how we use the ASByE toolfor generating the agents that automatically collect pagescontaining data of interest from the Web. These agents canbe seen as specialized crawlers that automatically traversethe publishing sites, exploring hyperlinks, filling forms, andfollowing threads of pages until they find the target pages,that is, the pages that contain data of interest. Each targetpage found is retrieved and can have their data extracted bya wrapper.

ASByE (Agent Specification By Example) is a userdriven tool that generates agents for automatically collect-ing sets of dynamic or static Web pages. The ASByE toolfeatures a visual metaphor for specifying navigation exam-ples, automatic identification of collections of related links,automatic identification of threads of answer pages gener-ated from queries, and dynamically filling of forms fromparameters provided for the agents, by the user. In a typicalinteraction with the tool, the user provides examples of (1)how to reach the target pages, filling any form, if needed,and (2) how to group together related pages. The output ofthe tool is a parameterized agent that fetches the selectedpages. The ASByE tool is fully described in [10].

The graphical interface of the ASByE tool uses a graph-like structure in which nodes displayed in a workspace rep-resent pages (or page sets) and directed arcs represent hy-

perlinks. The user navigates from node to node exploringthe hyperlinks according to her interests. The source nodesin the graph (i.e., the ones not pointed to by any other node)are called Web entry points and are directly selected by theuser by entering the URL of the page used to start the ex-ploration. The tool then fetches the page and builds a nodecorresponding to it. From this point onward, the user canselect, for each node, an operation to perform. The set ofoperations available depends on the type of node reached.The most common and simple operation allows the user toaccess a document to explore by selecting one of the hyper-links.

In Figure 2, we illustrate other features of the ASByEtool showing how to generate an agent for retrieving pagesfrom the Virginia Tech ETD Collection. The user be-gins by selecting the URL http://scholar.lib.vt.edu/theses/browse/by author/all.htm as an entry point. The page atthis URL contains a list of hyperlinks to each one of thetarget pages containing the documents available on the Vir-ginia Tech ETD Collection. Using a number of heuristicsbased on criteria such as hyperlink distribution, hyperlinkplacement, similarity among URLs, and similarity amonghyperlink labels, the tool identifies the list of links to thetarget pages, i.e., the pages to be collected. The user thencan select the agent generation operation. The agent re-sulting from this specification session will first retrieve theentry point URL, extract from it all URLs currently belong-ing to the link collection, and then retrieve each target page

Page 5: Calado03web R

Figure 2. Snapshot of an agent specification session with the ASByE tool.

corresponding to these URLs, giving them as its output.In some sites, there is no way to browse the whole docu-

ment collection. The only way of reaching the target pagesis by filling an HTML form, submitting it, and then navigat-ing through the answer pages. Although ASByE is capableof generating agents to perform such operations, this fea-ture was not used for the problem presented in this paper. Adetailed description of the feature can be found in [10].

5. Wrapping publishing sites: the DEByE tool

We now describe the use of the DEByE tool for gener-ating wrappers that extract data from pages in the collectedsites. For a full discussion of the DEByE tool and the DE-ByE approach, we refer the interested reader to [14].

DEByE (Data Extraction By Example) is a tool that gen-erates wrappers for extracting data from Web pages. It isfully based on a visual paradigm which allows the user tospecify a set of examples of the objects to be extracted.These example objects are taken from a sample page ofthe same Web source from which other objects (data) willbe extracted. By examining the structure of the Web pageand the HTML text surrounding the example data, the toolderives an Object Extraction Pattern (OEP), a set of regu-lar expressions that includes information on the structure ofthe objects to be extracted and also on the textual context inwhich the data appears in the Web pages. The OEP is thenpassed to a general purpose wrapper that uses it to extractdata from new pages in the same Web source, provided thatthey have structure and content similar to the sample page,by applying the regular expressions and some structuringoperations.

DEByE is currently implemented as a system that func-tions as a Web service, to be used by any application thatwishes to provide data extraction functionality to the end

users. This allows us to implement any type of interfaceon top of the DEByE core routines. For instance, for gen-eral data extraction solutions, we use a DEByE interfacebased on the paradigm of nested tables [5], which is simple,intuitive, and yet powerful enough to describe hierarchicalstructures very common in data available on the Web. Forthe Web-DL environment, we have built an ETD-MS spe-cific interface, with which the user can extract examples andassign them directly to ETD-MS fields. The DEByE/Web-DL interface was fully implemented in Javascript and canbe used via any Web browser that supports the language.

In Figure 3 we show a snapshot of a user’s session forspecifying an example object on one or more sample pages.The sample pages are displayed in the upper window, alsocalled the Source window. In the lower window, also calledthe Fields window, all the ETD-MS fields, such as Identi-fier, Title, etc., are available. The user can select pieces ofdata of interest from the source window and “paste” themon the respective cells of the fields window. After giving anexample attribute, the user can select the “Test Attribute”button, to verify if DEByE is able to collect the selected at-tributes from the sample pages, and finally, after specifyingall the example objects, the user can click on the “GenerateWrapper” button to generate the corresponding OEP, whichencompasses structural and textual information on the ob-jects present in the sample pages. Once generated, this OEPis used by an Extractor module that, when receiving a pagesimilar to the sample page, will perform the actual data ex-traction of new objects and then will output them using anXML-based representation.

Since we are using ETD-MS, all the extracted objectsare plain, i.e., they do not have a hierarchical or nestedstructure. In practice, the ETD-MS field thesis.degree con-tains four nested fields: name, level, discipline, and grantor.However, to simplify the interface, we chose to represent

Page 6: Calado03web R

Figure 3. Snapshot of an example specification session with the DEByE/Web-DL interface.

them as independent fields. It is interesting to note that DE-ByE also is capable of dealing with more complex objects,by using a so-called bottom-up assembly strategy, explainedin [14].

6. Providing DL services: the MARIAN system

MARIAN is a digital library system designed and builtto store, search over, retrieve, and browse large num-bers of diverse objects in a network of relationships [12](See also about Java MARIAN at http://www.dlib.vt.edu/projects/MarianJava/index.html). MARIAN is built uponfour basic principles: unified representation based on se-mantic networks, weighting schemes, a class system andclass managers, and extensive use of lazy evaluation.

In MARIAN, semantic networks, which are labeled di-rected graphs, are promoted to first-class objects and usedto represent any kind of digital library structure includinginternal structures of digital objects and metadata and dif-ferent types of relationships among objects and concepts(e.g., as in thesauri and classification hierarchies). In orderto support information retrieval services, nodes and links inMARIAN’s semantic networks can be weighted. The fun-damental concept is that of weighted object set: a set of ob-jects whose relationship to some external proposition is en-coded in their decreasing weight within the set. Nodes andlinks are further organized in hierarchies of object-oriented

classes. Each class in a particular digital library collectionis the responsibility of a class manager. Among their otherfunctions, each MARIAN class manager implements oneor more search methods. All MARIAN searchers are de-signed to operate “lazily”. During result presentation, onlya small subset of results is presented until the user explicitlyrequests the remaining answers. The number of instancesrequested, and thus the transmission costs across the net-work, are severely limited relative to the size of the setsthey manage.

In the context of the Web-DL environment, MARIANprovides searching and browsing services for the DL builtfrom the Web. Data from OAI providers and from non-OAI-compliant members coming from the Web-DL environmentare integrated into a Union Catalog. MARIAN is equippedwith OAI harvesters able to collect data periodically fromthe Union Catalog.

MARIAN is completely reconfigurable for different DLcollections; it uses digital library generators and a spe-cial DL declarative language called 5SL [11] for this pur-pose. Using these, specific loaders for different metadataformats (e.g., ETD-MS) can be generated. Once a new sub-collection is harvested, the loading process is applied. Forevery OAI record in the new sub-collection, a new part ofthe semantic network for the metadata record is created,representing its internal structure according to a metadatastandard and the connections among text terms and text

Page 7: Calado03web R

parts. The new part of the semantic network for the recordis then integrated into the MARIAN knowledge base. At theend of the loading process weights for the resulting collec-tion network are recomputed to consider global statistics.

Structured searches are supported by processing classes,class managers, and specific user interfaces also createdduring the DL generation process. Results of structuredqueries are displayed as ranked lists for browsing with en-tries and links created by specific XSL stylesheets. Pre-sentations of full documents, also generated with specialstylesheets, contain links that allow navigation to the origi-nally collected Web page.

7. Converting the extracted data

For our particular problem, to store the data ex-tracted by DEByE wrappers, we chose to use theETD-MS format, to comply with the OAI-PMH.Web sites, however, are far from containing standard-ized data, and some normalizing operations need to beperformed. Four main problems were found, when con-verting data to standard format: (1) mandatory data is notpresent in the page; (2) data is present, but only implicitly;(3) data is not in a required format; and (4) the extracteddata is not in the appropriate encoding.

Regarding the first problem, when data is not present inthe page, some replacement must be found. The solutionfor most mandatory fields is to use a default value, like“none”. For other fields, like “identifier”, a unique valuemust be generated, for instance by using sequential valuesor timestamps. The second problem happens when somepiece of information is known, but the data is not explicitlyrepresented in the page. For instance, for the dc.publisherfield, we may know we are collecting from the Virginia Techsite, but this information appears nowhere in the page. Thethird problem occurred mainly for the dc.date field. As re-quired by the ETD-MS, the date should be in ISO 8601 for-mat. Therefore, dates collected from the Web pages mustbe converted before being stored. Finally, in many ETDpages, many formatting HTML tags and HTML entities arefound within the text fields extracted. Also, non-Englishsites use many different character encodings to representforeign characters. Some cleaning routines are needed toeliminate spurious tags and to convert between characterencoding systems.

A general solution to this data cleaning and conversionproblem is very hard to find. In Web-DL, we chose to use anintermediate solution between fully automating the processand manual user intervention. A set of predefined modulesfor processing the data is available and the user can selectwhich ones to apply to the data being extracted. This pro-cess is fully implemented in the DEByE/Web-DL interface,providing a seamless integration to the Web-DL environ-

ment. For instance, as shown in Figure 4, for the date field,the user can apply a filter that converts the collected dateto ISO 8061 format. A filter to insert a default value alsocan be applied to all fields. Filters to convert the characterencoding and to strip HTML tags can be selected using thecheckboxes on the bottom of the window, since these willbe applied to all objects collected, independently of theirvalue or type. When extracting the data from a Web page,the DEByE generated parser applies the selected modulesto the objects. As a result all data will be in the desiredstandard format and can be stored using ETD-MS.

The data cleaning and conversion modules are simplystring processing routines. They take a string as input, pro-cess it, and return the resulting string as output. This pro-vides great flexibility for the construction of such modules.Thus, users can implement data cleaning modules accord-ing to their own specific needs, using any available pro-gramming language. More complex modules can be builtmaking use of an API provided by DEByE, which allows,for instance, the passage of parameters other than the stringto be processed. Of course, a set of predefined modules isalready included in DEByE, to provide users with no pro-gramming experience with as much data cleaning function-ality as possible. These are fully reusable and appropriatefor any project. This approach solves the problems found onour preliminary experiments with Web-DL [4], while main-taining the modularity of the environment and minimizinguser intervention in the process of building a digital libraryfrom the Web.

Once all the normalizing problems are solved, data canbe stored in a relational database, later to be rendered usingETD-MS. The database is then made accessible through anOAI server. Using the OAI-PMH, the data extracted fromthe Web can be shared with any DL acting as an OAI ser-vice provider. In our environment, the extracted data isharvested, and integrated with data harvested from otherNDLTD members within MARIAN.

8. An example Web ETD digital library

For this work, we collected pages containing ETDs fromthe sites of 21 different institutions selected from the listof NDLTD members, available at http://www.theses.org.These experiments were performed in the same context asreported in [4], but using the new integrated data cleaningand conversion modules. The ETD sites contained a totalof 9595 ETDs. It was not possible to collect informationfrom the sites of 7 institutions, since these were off-line oravailable only through a search interface.

Of the 6 mandatory ETD-MS fields, an average of 29.5%were missing in the collected pages, and were thereforefilled with a default value. This value was inserted bythe DL builder through the “default value filter” of the

Page 8: Calado03web R

Figure 4. Data cleaning and conversion in the DEByE/Web-DL interface.

DEByE/Web-DL interface, thus requiring only one simpleoperation per field. The default value filter also allowedfor the creation of unique identifiers by appending a serialnumber to the dc.identifier field. This was one of the majorproblems found in our previous experiments [4], which hadrequired the manual implementation of data insertion rou-tines. Here, it was solved by simply selecting options fromthe user interface. Table 1 shows the number of ETDs inwhich mandatory fields were missing.

Field name ETDs missing

dc.title 43 (0.4%)dc.creator 23 (0.2%)dc.subject 2349 (24%)dc.date 283 (3%)dc.type 703 (7%)dc.identifier 4800 (50%)

Table 1. Mandatory fields missing from thecollected ETDs.

Table 2 shows the numbers for each site collected. It canbe seen that, although not all, most of the information wascollected and extracted. It is interesting to note that fieldslike dc.publisher or dc.type, which are often implicit in thecollected site entry pages, but not available as extractableexamples, could be easily inserted as a default value for thewhole site. This means that the user needed only to typeone value for each site, whereas in our previous experimentseach site required the implementation of a separate routine.

The work required to include a site in the digital libraryconsisted of providing sets of examples to the ASByE andDEByE tools. For each collected site only one example wasneeded to create the crawling agents. To generate parsers

for data extraction, an average of 2–3 examples per fieldwere required. This represented an average of 9 minutes ofwork per site, by a specialized user, much less than previ-ously reported in [4]. The reduction in time was greatly dueto the new automated process of converting data to a stan-dard format. An interesting example is that of the dc.datefield, which previously required that the user extracted eachpart of the date (day, month, year) individually or imple-mented a conversion routine for the ISO 8061 format.

For the 21 institutions in our example, the total effort ofthe user summed up to approximately 3 hours and 15 min-utes. Notice that most of this is due to processing time,which can be improved by further optimizing the systemcode or using faster hardware. Since we do not expect Websites to be massively submitted to the system, this is a rea-sonable human effort to collect the data of interest. In thefuture, we expect to further automate this process, to reducethe time required, as more sites are harvested.

To illustrate, Figure 5 shows an ETD published by Upp-sala University. Once collected and extracted, all the meta-data is stored and made available by the MARIAN system.Figure 6 shows the results of a query over the ETDs col-lected from the Web, using the MARIAN system. By usingWeb-DL, not only searching, but any number of DL ser-vices, such as browsing and filtering, among others, can beperformed over the data extracted from the Web.

9. Summary and conclusions

We proposed the Web-DL environment for the construc-tion of digital libraries from the Web. Our demonstrationenvironment integrates standard protocols, data extraction,and digital library tools to build a digital library of elec-tronic theses and dissertations. The proposed environment

Page 9: Calado03web R

ETD SiteNumber of Fields per Mandatory Optional

ETDs ETD fields missing fields inserted

Adelaide U. 19 4 3 5Australia N.U. 39 5 3 4Concordia U. 3 9 0 2Curtin U.T. 57 10 0 2Griffith U. 40 5 3 4H-U. Berlin 439 7 1 2N.S.Y.U. Taiwan 1786 9 1 3OhioLINK 932 6 2 4Queensland U.T. 53 5 3 4Rhodes U. 134 5 3 5U. Kentucky 30 9 1 2U. New South Wales 89 5 3 4U. Tennessee 10 8 1 3U. Virginia 619 8 0 2U. Waterloo 105 5 3 5U. Wollongong 6 5 3 4U.P. Valencia 264 6 1 3Uppsala U. 1567 3 3 5Victoria U.T. 3 5 3 4Virginia Tech 3278 9 0 2Worcester P.I. 122 10 0 2

Table 2. Statistics for the data collected from the ETD sites.

Figure 5. Metadata for an ETD, available at the Uppsala University Web site.

Page 10: Calado03web R

Figure 6. Search results for query “fusion medical images” over the ETDs collected from the Web.

provides an important first step towards the rapid construc-tion of large DLs from the Web, as well as a large-scalesolution for interoperability between independent digital li-braries.

In this paper, Web-DL was applied to the NetworkedDigital Library of Theses and Dissertations, where we wereable to collect data from more than 9000 electronic thesesand dissertations. Due to the flexibility of the tools thatcompose Web-DL, we expect it to be easily applicable toany other domain, requiring, at most, changes in the userinterface. Different interfaces are easily implementable forspecific areas. Alternatively, a general interface like nestedtables can be used for the majority of data available on theWeb.

9.1. Lessons learned

Moving from the Web to a digital library is not a triv-ial task. Besides page collecting, we are faced with thedifficult problem of transforming semi-structured data intostructured data. Since there may not be a general solutionfor this problem, it is important to summarize the problemsfound and solutions applied when building the digital li-brary of ETDs from the Web.

One of the main problems found was that some of theETD sites to be collected provide access to their data onlythrough search interfaces, resulting in the hidden web prob-lem [13]. Although we did not approach this problem in ourexperiments, it can be partially solved by the use of the AS-ByE tool, which allows filling forms and submitting queriesto reach the hidden pages. Thus, although it is impossi-ble to guarantee that all data will be collected, the Web-DLenvironment is able to minimize the hidden Web problem,allowing us to obtain information otherwise unavailable bycommon Web crawlers.

Although there are many approaches for data extrac-tion, as discussed in [15], cases will always be found where

wrappers must be built manually. For instance, Web pageswithin a site can be very different from each other, makingit very hard to build a generic wrapper for the whole site. Inour experiments, the use of the DEByE tool avoided all suchproblems and all wrappers were built with minimum effort.This may be due to the fact that most ETD sites were quiteregular, but other experimental results [14] have shown thatour approach for Web data extraction might be equally ef-fective in more general and complex environments.

Finally, we face the problem of making the unstructuredWeb data fit a standard pattern. In Web-DL, we adopted acompromise solution, where a set of predefined data clean-ing and conversion modules is available and can be selectedby the user collecting data. To keep the solution as generalas possible, we allow users to implement their own extramodules, according to their specific needs. This solutionstill requires some user intervention, but it is very general,and user effort is reduced to a minimum.

In sum, each of the tasks for extracting information fromthe Web into a DL environment presents its own set ofproblems. A general solution for building digital librariesfrom the Web depends on general solutions for each ofthese tasks and on an efficient integration of such solutions.The Web-DL environment provides such an integration and,through experiments, has shown itself to be a fast and effi-cient DL colection building tool. Further, using Web-DLto achieve interoperability between independent digital li-braries requires as little effort as a gathering solution butprovides the quality of data and services usually obtainedonly by harvesting or federated solutions.

9.2. Future work

The MARIAN system allows for harvesting data fromNDLTD member sites using a variety of standard proto-cols. Therefore, an immediate first step is to integrate thedata extracted from the Web with data collected from other

Page 11: Calado03web R

member sites. A need resulting from this integration isthat of deduping: e.g., recognizing two instances of thesame object, coming from different sources, or combiningsearch results coming from internal repositories and exter-nal sources. Approaches to these problems are currentlybeing studied and will be implemented in the future. MAR-IAN also allows for the use of probability estimates for thequality of the extracted data and their utilization in retrievaloperations [12]. We are currently studying a coherent wayof computing these probabilities directly from the DEByEtool.

In the current stage of our work, the generation of wrap-pers for each Web source was accomplished by using theDEByE tool by selecting example objects (i.e., bibliogra-phy entries) from sample pages from each of the sources.As we expect the number of sources to increase rapidly, weintend to deploy the automatic example generation methoddescribed in [9]. Such a method allows using data availableon a pre-existing repository (e.g., titles, author names, key-words, subject areas, etc.) to automatically identify similardata in sample pages of new sources and to assemble ex-ample objects. By using it, we expect to automate the gen-eration of wrappers, at least for a considerable number ofcases.

We also will be extending the current Web-DL environ-ment to consider classification of data extracted from theWeb using a number of classification schemes, such as theACM or the Library of Congress classification schemes anddomain-specific ontologies. Finally, the current work on theWeb-DL environment is largely concentrated on improvingquality of data. In the near future we will extend and in-corporate new kinds of networks (e.g., belief networks) intoMARIAN to improve the quality of current and future DLservices.

10. Aknoweledgments

Thanks are given for the support of NSF through itsgrants IIS-0086227 and DUE-0121679. The first author issupported by MCT/FCT scholarship SFRH/BD/4662/2001.The second author is supported by AOL and by CAPES,1702-980. Work on MARIAN also has been supported bythe National Library of Medicine. Work at UFMG has beensupported by CNPq project I3DL, process 680154/01-9.

References

[1] A. Atkins, E. A. Fox, R. K. France, and H. Suleman. ETD-MS: an interoperability metadata standard for electronictheses and dissertations. http://www.ndltd.org/standards/metadata/, 2001.

[2] D. Bergmark. Collection synthesis. In Proceedings of the2nd ACM/IEEE-CS Joint Conference on Digital Libraries,JCDL’02, pages 46–56, Portland, Oregon, USA, June 2002.

[3] C. M. Bowman, P. B. Danzig, D. R. Hardy, U. Manber, andM. F. Schwartz. The Harvest information discovery and ac-cess system. Computer Networks and ISDN Systems, 28(1-2):119–125, December 1995.

[4] P. Calado, A. S. da Silva, B. A. Ribeiro-Neto, A. H. F. Laen-der, J. P. Lage, D. de Castro Reis, P. A. Roberto, M. V.Vieira, M. A. Goncalves, and E. A. Fox. Web-DL: an experi-ence in building digital libraries from the Web. In Proceed-ings of the 2002 ACM CIKM International Conference onInformation and Knowledge Management, pages 675–677,McLean, Virginia, USA, November 2002. Poster session.

[5] A. S. da Silva, I. M. R. E. Filha, A. H. F. Laender, andD. W. Embley. Representing and querying semistructuredweb data using nested tables with structural variants. InProceedings of the 21st International Conference on Con-ceptual Modeling ER 2002, pages 135–151, October 2002.

[6] D. de Castro Reis, R. B. Araujo, A. S. da Silva, andB. Ribeiro-Neto. A framework for generating attribute ex-tractors for web data sources. In Proceedings of the 9thSymposium on String Processing and Information Retrieval(SPIRE’02), pages 210–226, Lisboa, Portugal, September2002.

[7] E. A. Fox, M. A. Goncalves, G. McMillan, J. Eaton,A. Atkins, and N. Kipp. The Networked Digital Library ofTheses and Dissertations: Changes in the university commu-nity. Journal of Computing in Higher Education, 13(2):102–124, Spring 2002.

[8] N. Fuhr. Networked information retrieval. In Proceedingsof the 19th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval, page344, August 1996.

[9] P. B. Golgher, A. S. da Silva, A. H. F. Laender, and B. A.Ribeiro-Neto. Bootstrapping for example-based data extrac-tion. In Proceedings of the 2001 ACM CIKM InternationalConference on Information and Knowledge Management,pages 371–378, Atlanta, Georgia, USA, November 2001.

[10] P. B. Golgher, A. H. F. Laender, A. S. da Silva, andB. Ribeiro-Neto. An example-based environment for wrap-per generation. In Proceedings of the 2nd InternationalWorkshop on The World Wide Web and Conceptual Mod-eling, pages 152–164, October 2000.

[11] M. A. Goncalves and E. A. Fox. 5SL: A language fordeclarative generation of digital libraries. In Proceedingsof the 2nd ACM/IEEE-CS Joint Conference on Digital Li-braries, JCDL’02, pages 263–272, Portland, Oregon, USA,June 2002.

[12] M. A. Goncalves, P. Mather, J. Wang, Y. Zhou, M. Luo,R. Richardson, R. Shen, L. Xu, and E. A. Fox. Java MAR-IAN: From an OPAC to a modern digital library system.Lecture Notes in Computer Science, Springer, 2476:194–209, September 2002.

[13] P. G. Ipeirotis, L. Gravano, and M. Sahami. Probe, count,and classify: categorizing hidden Web databases. SIGMODRecord, 30(2):67–78, June 2001.

Page 12: Calado03web R

[14] A. H. F. Laender, B. Ribeiro-Neto, and A. S. da Silva. DE-ByE – data extraction by example. Data and KnowledgeEngineering, 40(2):121–154, February 2002.

[15] A. H. F. Laender, B. Ribeiro-Neto, A. S. da Silva, and J. S.Teixeira. A brief survey of Web data extraction tools. SIG-MOD Record, 2(31):84–93, June 2002.

[16] C. Lagoze and H. V. de Sompel. The Open Archives Initia-tive: Building a low-barrier interoperability framework. InProceedings of the 1st ACM/IEEE-CS Joint Conference onDigital Libraries, JCDL’01, pages 54–62, June 2001.

[17] C. Lagoze, D. Fielding, and S. Payette. Making global digi-tal libraries work: Collection services, connectivity regions,and collection views. In Proceedings of the 3rd ACM In-ternational Conference on Digital Libraries, DL’98, pages134–143, Pittsburgh, Pennsylvania, USA, June 1998.

[18] S. Lawrence, C. L. Giles, and K. Bollacker. Digital li-braries and Autonomous Citation Indexing. IEEE Computer,32(6):67–71, June 1999.

[19] K. Maly, M. Zubair, and X. Liu. Kepler - an OAIdata/service provider for the individual. D-Lib Magazine,7(4), April 2001.

[20] PhysNet. http://physnet.uni-oldenburg.de/PhysNet/, 2002.[21] I. H. Witten, S. J. Boddie, D. Bainbridge, and R. J. McNab.

Greenstone: A comprehensive open-source digital librarysoftware system. In Proceedings of the 5th ACM Interna-tional Conference on Digital Libraries, pages 113–121, SanAntonio, Texas, USA, June 2000.