Web search engines

10
even lacked sufficient basic world knowledge to be able to recognize ‘‘answers.’’ Unlike library catalogs, most publicly available search services on the Web (apart from those provided in digital library environments ) are funded by advertising dollars, and the effort to draw customers (i.e., searchers) to a particular ‘‘brand’’ tends to emphasize packaging over product. It appears to be the case that most people find one or two search engines which they like, often for reasons to do with ease of use, or speed of response, and that they become loyal customers (see, for instance, the survey by Stobart & Kerridge, 1996). To promote that brand loy- alty, search engine providers pay inordinately large fees to be the service which is linked to a browser button labeled ‘‘search’’ (Andrews, 1997b). It also seems that most casual users use one or two words to drive a search. Despite this, many search services now provide an array of search tools approaching those available in the realm of commercial online searching (except for the enhance- ments provided by indexing languages and authority con- trol). This review looks briefly at the history of World Wide Web search engine development, considers the current state of affairs, and reflects on the future (history is an interesting term in this context—McMurdo, writing in 1995, notes ironically that Oliver MacBryan’s World Wide Web Worm, which was released in early 1994, was already considered an ancestor of later models by the end of the same year). The Past The Internet became widely available to the schol- arly—and eventually business and consumer—commu- nity as a research and communication tool towards the end of the 1980s. The discovery process for access to the resources provided under each newly available function (beginning with telnet, and FTP, and moving through listservs, newsgroups, Gopher, WAIS, and the Web) fre- quently began with word of mouth, or, more commonly, word of E-mail—colleague telling colleague about some Web Search Engines Candy Schwartz Graduate School of Library and Information Science, Simmons College, 300 The Fenway, Boston, MA 02115- 5898. E-mail: [email protected] This review looks briefly at the history of World Wide Web search engine development, considers the current state of affairs, and reflects on the future. Networked discovery tools have evolved along with Internet re- source availability. World Wide Web search engines dis- play some complexity in their variety, content, resource acquisition strategies, and in the array of tools they deploy to assist users. A small but growing body of eval- uation literature, much of it not systematic in nature, indicates that performance effectiveness is difficult to assess in this setting. Significant improvements in gen- eral-content search engine retrieval and ranking perfor- mance may not be possible, and are probably not worth the effort, although search engine providers have intro- duced some rudimentary attempts at personalization, summarization, and query expansion. The shift to distrib- uted search across multitype database systems could extend general networked discovery and retrieval to in- clude smaller resource collections with rich metadata and navigation tools. The term ‘‘search engine,’’ as used by the average citizen of the World Wide Web, encompasses a wide variety of services which provide access to Internet re- sources. In the field of information retrieval research, a distinction is made between the interface and the en- gine — the former is the means by which the user interacts with the latter. Not so to Jane or John Q. Surfer, to whom the concept of search engine includes the interface, the retrieval and presentation mechanism, and the database. In common with library catalogs, users of Internet search engines are concerned with results, rarely understand or even consider the mechanisms, and even more rarely make full use of the capabilities provided by sophisticated search tools. Pollock and Hockley’s (1997) study of search engine use by Internet-naive adults ( whether com- puter-literate or otherwise ) certainly backs this up. Sub- jects misunderstood what the Internet is, what types of resources it contains, why searches might require several iterations, what types of keywords might be fruitful, and q 1998 John Wiley & Sons, Inc. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 49(11):973–982, 1998 CCC 0002-8231/98 / 110973-10 / 8n56$$1228 07-08-98 12:52:53 jasba W: JASIS

Transcript of Web search engines

even lacked sufficient basic world knowledge to be ableto recognize ‘‘answers.’’

Unlike library catalogs, most publicly available searchservices on the Web (apart from those provided in digitallibrary environments) are funded by advertising dollars,and the effort to draw customers (i.e., searchers) to aparticular ‘‘brand’’ tends to emphasize packaging overproduct. It appears to be the case that most people find oneor two search engines which they like, often for reasons todo with ease of use, or speed of response, and that theybecome loyal customers (see, for instance, the survey byStobart & Kerridge, 1996). To promote that brand loy-alty, search engine providers pay inordinately large feesto be the service which is linked to a browser buttonlabeled ‘‘search’’ (Andrews, 1997b). It also seems thatmost casual users use one or two words to drive a search.Despite this, many search services now provide an arrayof search tools approaching those available in the realmof commercial online searching (except for the enhance-ments provided by indexing languages and authority con-trol) .

This review looks briefly at the history of World WideWeb search engine development, considers the currentstate of affairs, and reflects on the future (history is aninteresting term in this context—McMurdo, writing in1995, notes ironically that Oliver MacBryan’s WorldWide Web Worm, which was released in early 1994, wasalready considered an ancestor of later models by the endof the same year) .

The Past

The Internet became widely available to the schol-arly—and eventually business and consumer—commu-nity as a research and communication tool towards theend of the 1980s. The discovery process for access to theresources provided under each newly available function(beginning with telnet, and FTP, and moving throughlistservs, newsgroups, Gopher, WAIS, and the Web) fre-quently began with word of mouth, or, more commonly,word of E-mail—colleague telling colleague about some

Web Search Engines

Candy SchwartzGraduate School of Library and Information Science, Simmons College, 300 The Fenway, Boston, MA 02115-5898. E-mail: [email protected]

This review looks briefly at the history of World WideWeb search engine development, considers the currentstate of affairs, and reflects on the future. Networkeddiscovery tools have evolved along with Internet re-source availability. World Wide Web search engines dis-play some complexity in their variety, content, resourceacquisition strategies, and in the array of tools theydeploy to assist users. A small but growing body of eval-uation literature, much of it not systematic in nature,indicates that performance effectiveness is difficult toassess in this setting. Significant improvements in gen-eral-content search engine retrieval and ranking perfor-mance may not be possible, and are probably not worththe effort, although search engine providers have intro-duced some rudimentary attempts at personalization,summarization, and query expansion. The shift to distrib-uted search across multitype database systems couldextend general networked discovery and retrieval to in-clude smaller resource collections with rich metadataand navigation tools.

The term ‘‘search engine,’’ as used by the averagecitizen of the World Wide Web, encompasses a widevariety of services which provide access to Internet re-sources. In the field of information retrieval research, adistinction is made between the interface and the en-gine—the former is the means by which the user interactswith the latter. Not so to Jane or John Q. Surfer, to whomthe concept of search engine includes the interface, theretrieval and presentation mechanism, and the database.In common with library catalogs, users of Internet searchengines are concerned with results, rarely understand oreven consider the mechanisms, and even more rarelymake full use of the capabilities provided by sophisticatedsearch tools. Pollock and Hockley’s (1997) study ofsearch engine use by Internet-naive adults (whether com-puter-literate or otherwise) certainly backs this up. Sub-jects misunderstood what the Internet is, what types ofresources it contains, why searches might require severaliterations, what types of keywords might be fruitful, and

q 1998 John Wiley & Sons, Inc.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 49(11) :973–982, 1998 CCC 0002-8231/98/110973-10/ 8n56$$1228 07-08-98 12:52:53 jasba W: JASIS

meta-engines, subject-specific services, personal desktopsearch agents, and ‘‘push’’ services. Like the Web, theworld of search services is now complex, rich, volatile,and frequently frustrating.

The Present

The ‘‘Literature’’

The body of work on Web search engines is alreadyquite extensive. Print and electronic trade literature (andWeb resources) in the information and financial industriesfeature product announcements and discussions of thebusiness aspects of managing and financing search ser-vices. Search engine guides, feature tables, and practicaltips are published in professional journals or mountedon the Web by libraries, search service companies, andindividuals with an interest in searching. Journals andconference proceedings in computer science, engineering,library and information science, and related fields carryperformance comparisons and research reports on newwork in networked information retrieval. Traugott Koch(1997) maintains an extensive bibliography of print andelectronic sources on search service comparisons, re-trieval, and indexing the Internet. Continuously updatedsites such as Search Insider, maintained by LEO: Librari-ans and Educators Online (1997), Danny Sullivan’s(1997) Search Engine Watch, and ZDNet’s (1997)Whole Web Catalog/Search hold original content as wellas collections of links to print and electronic sources onWeb search services. Most recently, Maze, Moxley, andSmith (1997) have published a textbook which explainsthe technology behind search engines, and focuses in de-tail on the seven most popular.

Types of Search Services

There are two basic search engines types: Classifiedlists, of which Yahoo is the best-known example, andquery-based engines, which are far more common (forexample, AltaVista, HotBot, Excite, and so on). Bothmaintain databases containing representations of Webpages (and sometimes other resources) in some form.Classified lists present arrays of resource links in system-atically arranged categories, often quite complex hierar-chies. Query-based engines run search algorithms basedon user-input text expressions. Classified lists usuallyallow query-based search of category labels and resourcetitles, and query-based services often provide browsablecategories as well, but it is generally obvious that a searchengine is primarily of one kind or the other.

Web users and researchers (especially librarians andother information professionals) being who they are, it isnot surprising that aids have been developed to cope withthe proliferation of general search services and the avail-ability of numerous specialized services. These aids in-clude directories and meta-engines. Directories are lists

new file or site. Printed directories of electronic discus-sion groups, E-journals, telnet-accessible services, and soon, were published, but print publishing has never beena particularly appropriate method for keeping up to datewith Internet resources.

Fortunately, each function was also followed quitequickly by the deployment of one or more electronic dis-covery devices. Files available via anonymous FTP couldbe found using archie. Listserv archives could be searchedvia commands sent to the server. Online directories suchas HYTELNET and LIBS pointed to libraries and othercollections that could be reached with telnet. Widespreadadoption of Gopher in the early 1990s was attended bythe development of veronica, and then jughead, both ofwhich provided keyword search through the text of Go-pher menu lines, and could be used within the confinesof one site, or on information gathered from all of Gopherspace. WAIS (Wide Area Information Server) was some-what different. Developed by Brewster Kahle, then atThinking Machines, Inc., WAIS drew from work on twofronts: Thirty years of research in the information sciencecommunity on using statistical characteristics of text forretrieval, and more recent developments in the librarycommunity on the Z39.50 protocol for interoperabilitybetween multitype automated library catalogs. PublicWAIS sites presented directories of collections availablefor search, and search through specified collections re-sulted in a list of files ranked primarily on the basis ofsearch term occurrence.

The year 1991 saw the first general release of WWWline mode browsers at CERN. Windows and Macintoshgraphical browsers arrived in 1993, and the subsequentrapid growth of the World Wide Web is well-documented(Calliau, 1995; Gray, 1996). There is little agreement asto the actual number of Web resources available onservers around the world, but the word ‘‘overwhelming’’is usually considered apt. In the early days, when serverswere few and knowledge of markup rare, resource discov-ery started at the CERN Web site, which includedan alphabetized subject listing of links to pages formingthe World Wide Web Virtual Library (the openingCERN page as it appeared on November 3, 1992, hasbeen archived at http: / /www.w3.org/History/19921103-hypertext /hypertext /WWW/TheProject.html) . In 1994,as the number of HTTP resources increased, the servicesthat we now know as search engines began to appear.Most seem to have started as research—or recreation—projects undertaken by graduate students, faculty, sys-tems staff, and other ‘‘Web-heads.’’ Some fell by thewayside as the task began to exceed the capacity of lim-ited human and technical resources; most of those thatsurvived were either acquired by corporations, financedby advertising and capital investment, or funded by re-search initiatives. By 1996, search engines began to befeatured in trade journals, and then in business and dailynewspapers, and on network television. Differentiatedsearch products proliferated—search engine directories,

974 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—September 1998/ 8n56$$1228 07-08-98 12:52:53 jasba W: JASIS

Some general-content query services reduce the size ofthe haystack by establishing subset databases of selectedand reviewed resources, or most popular ones. Some pro-vide proprietary, and perhaps fee-based, informationsources. In addition to indexing HTML pages, manyquery-based services include Usenet news (a limitedtimespan) and Gopher menus, and some harvest and indexa wider variety of formats, such as ASCII, VRML,SGML, and PDF files.

The roster of indexed elements in the representationvaries from service to service. Some services index everyword on a page, in some cases including the URL, ALTtext in the »IMG … tag, and comment text. Positional andmarkup tag information may be stored with indexed textto improve retrieval and ranking effectiveness. Othersindex only frequently occurring words, or only wordsoccurring within certain markup tags, or only the first somany words or lines of HTML files. Stopwords may beor not be applied, and if applied, may include words ofvery high frequency, such as ‘‘Web,’’ ‘‘Internet,’’‘‘html,’’ and so on. Representations might (rarely) beenhanced by the intellectual addition of keywords, cate-gory terms, or (even more rarely) reviews and summaries,all of which may contribute to retrieval, but have a nega-tive effect on timeliness in that they require human inter-vention.

Search Features

The opening screen of a typical query-based searchengine presents an input box and possibly a choice as tohow the query terms are to be processed (e.g., ‘‘any/all /this exact phrase’’) . Defaults with respect to automaticstemming, case sensitivity, matching to irregular pluralsor alternate forms (such as ‘‘jail’’ and ‘‘gaol’’) , fieldsbeing searched, stopwords, and so on, are rarely obvious,even to searchers who might want to know. Many searchengines do provide a full array of sophisticated searchcommands, but the casual user is usually protected fromthese by their being in a help file or only accessible underthe guise of ‘‘power’’ or ‘‘advanced’’ search. Commonadvanced capabilities (although one suspects that theyare rarely used) include:

j Boolean search (in some cases with nested parenthe-ses);

j specification of terms which must or must not be pres-ent;

j truncation (terminal and internal) or conversely, inhibi-tion of automatic stemming (Excite is one of the fewengines to search character strings, allowing for initialtruncation);

j exact phrase match;j proximity searching (which can be as sophisticated as

that found in commercial online searching);j fielded search (based on markup tags identifying title

text, meta text, heading text, link, and so on);j specification as to case sensitivity;

of search engines, usually organized into useful categories(e.g., searching for people, searching for companies, etc.) .Some directories simply provide links, many provide in-put boxes so that queries can be sent directly to selectedengines, and some provide detailed annotations as toscope and search features (Eureka, for example) . A meta-engine sends a query to more than one search engine (asmany as 20 or 30), sometimes of the user’s choosing.Results may be merged, with duplicates removed (forinstance, MetaCrawler, the first meta-engine) , or may bepresented as separate lists from each engine (SuperSeek,for example, displays results from each engine in a sepa-rate frame). While most meta-engines are sites on theWeb, personal search agents such as QuarterDeck’s Web-Compass or FerretSoft’s WebFerret are desktop exam-ples.

Search Service Content

It is usually possible to submit a Web page to a searchservice for inclusion in the indexed database of represen-tations. There are Web sites which facilitate this process,and most search services provide a ‘‘submit your URL’’procedure on their home pages. Most search services alsoacquire database information from Web pages throughthe use of agents, or robots, which retrieve URLs andthen representation data from the ’Net, following linkpaths and seeking new or changed resources. For exam-ple, AltaVista’s web spider ‘‘Scooter’’ collects data onroughly 6,000,000 Web pages daily. The frequency withwhich a database is refreshed with new or changed infor-mation varies—Melee’s (1997) weekly report is an inter-esting look at the relative coverage of some popularsearch engines. Sullivan’s (1997) pages include a searchengine ‘‘EKG’’ which charts the frequency with whichseven search engines drop in on two sites.

Agents can learn to reexamine frequently sites whichchange a great deal, or which many other pages link to.Alternatively, they may emphasize resources which are‘‘isolated,’’ that is, not referenced by many other pages.When retrieving one URL among a complex collectionof inter- and intralinked pages, an agent can mark the sitefor reexamination, to acquire at a later point informationabout all or a sampling of the remaining pages. Robotstrategies for following complex inter-document links aregenerally based on the assumption that Web sites arearranged hierarchically or at least logically (for example,shorter path names should reflect superordinate pages) ,an assumption which can lead to failure to collect keymaterials (Smith, Moxley, & Maze, 1997). Also, use offrames, imagemaps, CGI, Java, and so on, can impedethe progress of some agents in the ‘‘trawling’’ process.

Classified list services review submitted or harvestedinformation for inclusion. Subject-specialized search ser-vices such as LawCrawler also engage in an intellectualselection process to maintain content quality, and usetailored agents to limit discovery to appropriate resources.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—September 1998 975/ 8n56$$1228 07-08-98 12:52:53 jasba W: JASIS

j date of entry into the database;j URL;j language;j category label ( if the search service includes a classi-

fied array);j search terms present in the resource.

Additional elements may be included with results dis-plays. Excite and several other services provide a rele-vance feedback option (‘‘more like this’’) using highfrequency terms from the identified relevant document torestate the query. Lycos’ long format includes the numberof external links in a resource. HotBot identifies dupli-cates when they enter the database, and presents them as‘‘alternates’’ under one representative in the results list.Infoseek displays customized category or site listingsbased on query terms. All of these devices are intendedto help searchers make sense of results, but still the prod-uct of a query in current search engine circumstances isoften poorly ordered and bewildering.

Performance Evaluation

Search service comparisons abound, and most consistprincipally of feature charts. These are useful in them-selves, but date very quickly, and do not say much aboutretrieval performance. Matt Slot’s (1996) features chartfor The Matrix is superior to most in that he scores eachengine for each feature. Information industry Web sitesand trade publications publish some performance ratings,although, in most cases, the methods of assessment andevaluation are either unspecified or unsystematic. ZDNetholds an annual ‘‘search engine showdown’’ (Lake,1997) and publishes other ratings (Randall, 1997) —theshowdown testers are business people who use the In-ternet, and the queries show some complexity, but this isnot ‘‘research’’ such as one might find in the informationretrieval literature. Internet World and PC World carrysimilar lab tests, ratings, and feature charts (Haskin, 1997;Overton, 1996; Scoville, 1996; Venditto, 1996).

Search engine comparisons by and for information pro-fessionals display more depth. Courtois, Baer, and Stark(1995) assessed seven services based on ‘‘known-URL’’searches. Feldman (1997) tested precision in seven searchengines using real user queries for information about com-panies, products, medical data, foreign information, tech-nical reports, and current events. Text and links in the first10 retrieved items were examined for relevance. Peterson(1997) compared eight search engines for the results oftwo queries repeated in three time periods. Kimmel(1996) performed simple searches in nine engines, mostlyto examine coverage and compare features. Westera(1997) used queries of different types (single keyword,plural keyword, phrase, Boolean, and proper name), andreplicated her tests 6 months apart in time. In one of thebest examples of this type of comparison, Zorn, Emanoil,Marshall, and Panek (1996) used three complex Boolean

j restriction by date, domain, language, or file type(based on file name extension).

Search Results

Although experience with search engines sometimesmakes this hard to believe, search results are usuallyranked by relevance, with options for sorting the top ofthe list instead by URL (useful for spotting pages fromthe same site) or by date. Being proprietary, the meansby which ranking is accomplished are rarely described inany great detail. At the very least, query term frequencyin the document is taken into account, and calculationsmay involve normalization for document length. Addi-tional possible parameters include:

j Proximity of query terms to each other in the docu-ment;

j frequency of query term across the database;j term location in the document (with higher weight as-

signed to »TITLE … and »META … text or to terms whichappear earlier in the document);

j term location in the query (earlier terms being moreimportant) ;

j document popularity (frequency with which other doc-uments in the database link to the document);

j whether or not the document forms part of the ‘‘re-viewed’’ content provided by the search service.

Word spamming (embedding repeated terms in »META…tags or elsewhere for purposes of promoting high rank)is penalized by several search engines. Some allow theuser to participate in ranking decisions by, for instance,specifying the strength of match, the number of terms tobe matched, and so on. The new Lycos Pro Java-basedsearch panel (Fig. 1) has slide bars to direct the degreeto which the following is important: Presence of all queryterms; frequency of query terms across the database; andappearance of query terms early in the title, early in thetext, close together, and in exact order. Using AltaVista’sadvanced mode, a searcher can enter ‘‘ranking terms’’which will be considered in ordering results.

Most search engines present results 10 or so at a time,in a default format showing title and some text, and ac-companied by a cheery message along the lines of ‘‘1–10 of 69,010.’’ Both the number of hits per page and theformat can usually be changed. Format displays (varia-tions on short, medium, and long) can include any of thefollowing:

j Title;j relevance score (expressed in a variety of scales);j summary (summaries may be prepared abstracts, out-

lines created by extracting text in headings tags, mostfrequent words, the first so many words, or some auto-matically constructed representation);

j file size in bytes;j file date;

976 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—September 1998/ 8n56$$1228 07-08-98 12:52:53 jasba W: JASIS

ments were made for the first 20 results, using a six-point scale. The study by Tomaiuolo and Packer (1996a,1996b) is notable for its sheer scale—high precisionsearches for 200 topics gathered from a reference deskin an undergraduate setting, and searched on AltaVista,Infoseek, Lycos, Magellan, and Point. Precision was cal-culated for the first 10, and the researchers rather thanusers assessed relevance.

Leighton and Srivastava (1997) express a concern forthe presence of bias and lack of statistical validity in mostsearch engine performance evaluation, and attempt toovercome these shortcomings in their study of AltaVista,Excite, HotBot, Infoseek, and Lycos. Fifteen queries,most of which were reference desk questions, were input( in most cases) as unstructured text. Each query was runagainst all engines on the same day, and the first 20results were merged and ‘‘blinded’’ ( i.e., search engineidentification was removed). Active links in the resultswere scored for relevance on a four-point scale. Resultsare presented with detailed explanations, and the analysisexamines the effects of collapsing the relevance scale,weighting for item position in the list, adjusting for whenresults were fewer than 20, and penalizing for duplicates.

The performance evaluation literature is growing, al-though Su (1997) notes the absence of a systematic ap-proach, points out the lack of consistency between re-searchers in choosing what to measure and how to mea-

FIG. 1. Lycos Pro Power panel.

searches across four search engines to illustrate a compar-ison of advanced features, indexing depth, and quality ofhelp. While the low number of queries and other designfactors prohibit any valid statistical analysis, the in-depthdiscussion of search results is illuminating.

Chu and Rosenthal (1996) evaluated AltaVista, Lycos,and Excite using 10 queries derived from reference ques-tions, and using available command features for each en-gine. Relevance judgments for the first 10 results fromeach engine formed the basis for precision calculations.In the context of developing a meta-engine, Gauch andWing (1996) calculated a ‘‘confidence factor’’ for sixsearch engines based on 25 queries, taking into accountnot only precision in the first 10 results, but also rankingaccuracy. Schlichting and Nilsen’s (1996) work evalu-ated AltaVista, Excite, Infoseek, and Lycos based onsearches using from four to six keywords gathered fromtopics submitted by academic faculty. First-10 resultswere merged, and scored by subjects using a seven-pointscale for useful items (this is one of the few studies touse user rather than researcher relevance assessments) .Search engine ratings incorporated not only relevant andnon-relevant retrieved, but also relevant and non-relevantmissed. Ding and Marchionini (1996) compared Info-seek, Lycos, and Open Text for precision, duplication,link validation, and degree of overlap. Five complex que-ries were run against each engine, and relevance judg-

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—September 1998 977/ 8n56$$1228 07-08-98 12:52:53 jasba W: JASIS

I don’t have to do searches per se, directly using thesepowerful indexing machines. . . . I want to use themindirectly. I want my agent to use them directly and Iwant to see the results, and I want to see the results in away that’s unintrusive and helpful. (p. 70)

He also refers to the need for personalized views of theInternet.

Why do we have the same experience, the same view,when we sit down at our respective browsers? . . . Wehave an impersonal view of that sea of information. Iwant a personal view that zeros me in on the one percentof the Internet that I care about. (p. 72)

Toolboxes for users, truly intelligent agents, custo-mized personal views of the Internet, automated digitallibrarians—these are certainly desirable. Developmentson a number of research and industrial fronts look in thesedirections. Rudimentary agents and push services havebeen available for several years, and much is beinglearned about filtering and rule-based resource discovery(Hermans, 1997). Investigations into visualization oflarge document spaces hold promise for offering methodsof summarizing search engine results or database con-tent—examples of applications can be seen in the onlineabstracts of the 34th Annual Clinic on Library Applica-tions of Data Processing (Visualizing subject access,1997). The annual Text Retrieval Conferences (TRECs)have encouraged continued work on natural language pro-cessing and statistical retrieval, laying the foundations forimprovements in query processing and ranking (NationalInstitute of Standards and Technology, 1997). Automatedsound and image indexing (including moving images)can extend the search engine resource pool, and enhancemedia representation (Stix, 1997). Applications of theseresearch areas can already be seen in digital libraries andin search tools developed for intranets. However, general-content search engines have been taking small steps inthe same directions.

Personalization

Services already exist that are targeted to particularusers (college students, senior citizens, and so on) andin specific subject areas or information types (such asyellow pages data) . The ability to attract advertising forproducts directed to an identifiable market presumablydefrays the costs of resource selection, evaluation, anddescription. At the least, a restricted and selected databaseis less likely to render overwhelming search results. Theinformation needs of a known user group may be easierto predict and, in some cases, the data are better defined.These conditions support more parameter-rich templatingin query forms, and purpose-built stable classificationschemes for browsing. Even general-content search en-gines can offer users some customization and profiling.

sure it, and laments the absence of the end user frommost such studies. Investigations are largely concernedwith precision, since true recall is somewhat difficult ina Web environment. For that matter, true precision iselusive as well, given search results on the order of severalthousand ranked items. Most studies take a practical‘‘first-10’’ or ‘‘first-20’’ approach, assessing relevanceonly for the top of the ranked list, and in most casesresearchers rather than users make relevance judgments.Even so, there is something to be said for attemptingmore than a comparison of features and a personal inter-pretation of effectiveness based on a small sample.

The purpose of almost all evaluation literature is todetermine the best engine. The outcomes usually indicatethat differences in performance among the best two orthree are not large, and that different engines serve differ-ent search purposes. In any event, search engines intro-duce new features so frequently that many observationsare almost obsolete by the time they are published. Intheir excellent summary of search service comparisonsand evaluations, Barry and Richardson (1996) tabulateextracts from conclusions drawn in 11 different studies—the general gist is that no one service is ‘‘best,’’ and thatserious searchers should routinely use more than one.

The Future

One has to agree with Berghel that ‘‘search enginesas they now exist represent a primitive, first cut at efficientinformation access on the Internet’’ (1997, p. 20). Ber-ghel goes on to say that the fault lies not so much withsearch engines, but with the characteristics of the re-sources which they attempt to index, which he character-izes as ‘‘more wheat than chaff ’’ (p. 21). He is notthe first to suggest that exploring alternative methods fornetworked information discovery may be more fruitfulthan refining search engines. Berghel’s alternatives in-clude information agents, information customization, in-formation providers whose ‘‘brands’’ are associated withvalue-added resource description and access, and pushservices. Larsen (1997) agrees that there is only so muchto be accomplished by tweaking search engines, which,after all, evolved from a world where documents are gen-erally homogeneous and well structured: ‘‘Increased doc-ument and information density resists discrimination bytraditional search technologies. . . . Increased complexityof search tools is not likely to significantly assist theaverage Web searcher, whose queries include little morethan two terms.’’ He suggests that we turn our attentionto developing tools to help users define an informationspace through which they can browse, rather than tryingto help them zero in on the perfect answer. Jim White(1996) of General Magic imagines—and hopes to de-velop—agents that will take over the task of interactingwith search engines.

978 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—September 1998/ 8n56$$1228 07-08-98 12:52:53 jasba W: JASIS

popular search engines at the moment can be seen withAltaVista’s Refine (Fig. 3) , developed by Francois Bour-doncle. The results of a search can be refined with a listor a graph. This list shows additional search topics (labelsfollowed by keywords) which occur frequently in theretrieved items, ranked by relevance, while the graphshows the same topics in connected clusters, with popuplists of keywords whose strength of association is indi-cated visually. Topics (or keywords in topics, in the caseof the graph) can be selected for adding to a reexecutedsearch. Infohiway’s Surf-N-Search default record displayincludes ‘‘fuzzy links,’’ which presumably are high fre-quency words from the resource, and which can be usedin a new search. Excite automatically extends a searchby adding terms which are strongly, statistically associ-ated with query words, and a Search Wizard suggestsadditional terms for narrowing the results of a query.

Coverage

New formats, increasing deployment of multimedia re-sources, and evolving markup standards cause problemsfor resource discovery agents (Andrews, 1997a). Thosesearch service providers that incorporate access to mediado so primarily on the basis of file name extensions andtext extracted from contextual materials. HotBot searches,for example, can be restricted to specific media types(image, Shockwave, JavaScript, Java, audio, Acrobat,VBScript, ActiveX, video, and VRML). AltaVista’sfielded search labels include ‘‘image:’’ and ‘‘applet:’’.Lycos’ specialized picture and sound retrieval utility ap-pears to be based on ALT text and file names. Infoseek’sImageseek also seems to use contextual text for retrieval,but results are displayed in thumbnails, and browsablecategories of thumbnails are provided as well. Advancesin abstracting video, as well as research in indexing imagecontent (see, for example, the work of Swain, Frankel, &Athitsos, 1996), should result in expanded access to mul-timedia files in search engines.

Metadata

Returning to the theme of ‘‘less may be better,’’ at-tempts to index Web resources using controlled subjectanalysis and entry convention tools have resulted in somesubstantial, specialized metadata collections which mayplay a greater role in general resource discovery in thefuture (Lynch, 1997). A good example of this is foundin the Access to Network Resources project, part of theUK-based eLib program (Electronic Libraries Pro-gramme, 1997). In this project, a number of gatewayshave been developed for different subject areas (art andarchitecture, business education, engineering, medical in-formation, social sciences, and so on). Each gateway ischarged with selecting high quality network resources andcreating metadata records containing descriptive informa-tion (including keywords, specialized thesauri, and classi-

HotBot and AltaVista, for example, let users save andreload search preferences such as date, domain, language,results format, media type, and so on. Excite will developa personal ‘‘channel’’ for an individual, based on demo-graphic data and identification of areas of interest. Onceestablished, the personal channel at the Excite site dis-plays customized information and Web links.

Another, very basic, aspect of ‘‘personal service’’ isknowing what searching the user has just done. The abilityto modify search strategies is something we take forgranted in commercial online searching, and lose oncewe move to the stateless world of Web searching. Manypopular search engines now return search results precededby a link to a page suggesting methods for improvingsearch results, but these help files are generic rather thanspecific to the particular query. Just recently, severalsearch engines have introduced methods of giving theappearance of retaining search sets for modification. Info-seek offers a ‘‘search these results’’ option once hits havebeen returned, and a pipe command (e.g., catsÉfood)which searches for the second word in items which con-tain the first. Neither of these is more than a BooleanAND (although reversing the term order in a piped state-ment affects ranking). Neither is at all the same as beingable to manipulate sets. Still, perhaps it represents a per-ception that certain types of searchers need these kindsof capabilities.

Summarization

Whit Andrews (1996) paraphrases Nick Lethaby ofVerity, Inc., on the topic of search engines:

Users don’t want to interact with a search engine muchbeyond keying in a few words and letting it set out results.That puts the burden on the engine’s vendor to make aproduct that gives the user the ability to find the generalset of documents he or she needs without checking eachdocument one at a time. (p. 42)

One way of doing that is to summarize results, so that atleast users can identify smaller sets for detailed inspec-tion. Inference Find appears to use a combination of do-main names and »TITLE… text to organize results intocategories. As noted earlier, several search engines permitresults to be sorted by site ( i.e., URL), but InferenceFind goes the extra step of providing meaningful categorylabels. Another way to reduce overload is to decrease thesize of the dataset. Yahoo and Infoseek both offer querysearch restricted to specific categories in their classifiedcollections of Web pages. Northern Light is exploringboth automatic and intellectual means to divide searchresults into useful folders (Fig. 2) .

Query Expansion

Quite a few search engines have recently added queryexpansion tools. One of the most sophisticated among

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—September 1998 979/ 8n56$$1228 07-08-98 12:52:53 jasba W: JASIS

classification, alphabetical subject indexing, and authoritycontrol to Web resources—brief descriptions and linksto many of these are maintained by McKiernan (1997)and Schwartz (1997). Many are designed for a particularaudience—an academic community, users of a publiclibrary, and so on. Some are more general in scope. OCLCInc. supports two such projects: NetFirst, the commercialdatabase of network resources, and InterCat, the product

FIG. 2. Northern Light’s folders.

fication notation) to facilitate searching and browsing.Most gateways are mounted using ROADS (ResourceOrganisation And Discovery in Subject-based services) ,which provides a set of software tools and a standardsframework. Users have a more or less consistent viewfrom gateway to gateway, and are provided with effectivediscovery tools for both query search and browsing.

There are a number of other worthy projects that add

FIG. 3. AltaVista’s Refine.

980 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—September 1998/ 8n56$$1228 07-08-98 12:52:53 jasba W: JASIS

nally, there is some indication that networked informationdiscovery will extend to both large and small collectionsof representations. Rarely, at least in the field of informa-tion science, do the interests of academic research andcommercial product development so closely coincide.

Appendix: Additional Sources

Links to most of the search services and sites men-tioned in this review are provided on the following pagesmaintained by the author:LIS on the Web »http://www.simmons.edu/Çschwartz/mylis.html…Search services »http://www.simmons.edu/Çschwartz/mysearch.html…Subject access on the Web »http://www.simmons.edu/Çschwartz/mysub.html…

References

Andrews, W. (1996). Search engines gain tools for sifting content onthe fly. Web Week, 2(11), 41–42.

Andrews, W. (1997a). Searching questions. Web Week, 3(28), 1, 44.Andrews, W. (1997b, March 24). Windfall for Netscape: Search engines

to pay $70M for inclusion on browser buttons [Online] . Web Week,3(7) . Available: http: / /www.webweek.com/97Mar24/news/wind-fall.html [1997, September 2].

Barry, T., & Richardson, J. (1996, November 5). Indexing the Net. Areview of indexing tools [Online] . Available: http: / /www.scu.edu.au/sponsored/ausweb/ausweb96/educn/barry1/paper.html [1997, Sep-tember 2].

Berghel, H. (1997). Cyberspace 2000: Dealing with information over-load. Communications of the ACM, 40(2) , 19–24.

Calliau, R. (1995, October 3). A little history of the World Wide Web[Online] . Available: http: / /www.w3.org/History.html [1997, Sep-tember 8].

Chu, H., & Rosenthal, M. R. (1996). Search engines for the WorldWide Web: A comparative study and evaluation methodology. In S.Hardin (Ed.) , Global complexity: Information, chaos, and control.Proceedings of the 59th ASIS Annual Meeting (pp. 127–135). Med-ford, NJ: Information Today. Also available: http: / /www.asis.org/annual-96/ElectronicProceedings/chu.html [1997, September 8].

Courtois, M. P., Baer, W. M., & Stark, M. (1995). Cool tools for search-ing the Web. Online, 19(6) , 14–32.

Ding, W. I., & Marchionini, G. (1996). A comparative study of websearch service performance. In S. Hardin (Ed.) , Global complexity:Information, chaos, and control: Proceedings of the 59th ASIS AnnualMeeting (pp. 136–142). Medford, NJ: Information Today.

Electronic Libraries Programme. (1997, June 17). Project details[Online]. Available: http://www.ukoln.ac.uk/services/elib/projects/[1997, September 11].

Feldman, S. (1997, August 29). Just the answers, please: Choosing aWeb search service [Online] . Searcher Magazine. Available: http: / /www.infotoday.com/searcher/may/story3.htm [1997, September 2].

Gauch, S., & Wang, G. (1996, September 8). Information fusion withProFusion [Online] . Available: http: / /www.csbs.utsa.edu:80/info/webnet96/html/155.htm [1997, September 8]. (Presented at WebNet’96.)

Gray, M. (1996, December 6). Internet statistics: Growth and usageof the web and the Internet [Online] . Available: http: / /www.mit.edu/people/mkgray/net / [1997, September 8].

Haskin, D. (1997, September) . The right search engine: IW Labs test[Online] . Internet World, 9. Available: http: / /www.iw.com/1997/09/report.html [1997, September 2].

of an international cooperative experiment in applyingcataloging tools to Internet objects. Metadata records inthe University of California’s scholarly resource collec-tion called INFOMINE include keywords, subject head-ings, and annotations. A growing community of informa-tion providers are applying the Dublin Core and otheremerging metadata standards for resource description (In-ternational Federation of Library Associations and Institu-tions, 1997), and funded digital library projects in theU.S. and elsewhere deploy publicly accessible customizedretrieval tools for access to collections of largely scholarlymaterials.

The advantage to services such as these are many—enhanced representations, well-developed search andbrowsing tools, quality control at the selection stage, andresults which are likely to exhibit better precision andrecall. The labor costs are high, so this is not a likelymodel for general-content search engines. The content is,of course, specialized by region, subject, or audience, but,even so, it is probable that some percentage of the queriesput to general search engines would be better served bythese smaller, more easily-navigated collections. Unfortu-nately, they are not in the public eye, certainly not to thedegree that Yahoo, AltaVista, HotBot, and others of thatilk are. Furthermore, search engine agents do generallynot retrieve the metadata content, since results are createdon the fly in response to queries. This content, along withWeb-accessible OPAC records, and any other data pulledfrom a database and converted into HTML in responseto a local search, are some examples of why the claim toindex ‘‘everything on the Web’’ is a gross exaggerationby general-content search engines. However, it appearsthat networked discovery may move towards a distributedmodel, and that may bring about some dramatic changes.Infoseek has just received a patent on a ‘‘novel techniquefor performing searches of Web sites on the Internet’’(Infoseek Corporation, 1997). This enables results fromdifferent search engines (and DIALOG is one cited exam-ple) to be merged and ranked at the client end. The re-trieval issue then becomes one of identifying the searchengines most likely to be fruitful for a specific query, andthis is where smaller, specialized, enhanced collectionsmay come into play.

Concluding Remarks

We are at an interesting moment in information re-trieval research history. It is computationally feasible torun fairly complex retrieval and ranking algorithms inlarge databases in a tolerable amount of real time. Existingcollections of very large, if somewhat heterogeneous, da-tabases are owned by corporations whose commercial in-terests are served by improvements in interface designand retrieval effectiveness. Concurrently, government andprivate funding initiatives support scholarly research intodigital libraries, providing testbeds for exploring net-worked discovery and retrieval in controlled settings. Fi-

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—September 1998 981/ 8n56$$1228 07-08-98 12:52:53 jasba W: JASIS

Available: http: / /www.zdnet.com/pccomp/features/ internet /search/[1997, September 5].

Schlichting, A., & Nilsen, E. (1996, December 17). Signal detectionanalysis of WWW search engines [Online] . Available: http: / /www.microsoft.com/usability/webconf/schlichting/schlichting.htm[1997, September 2].

Schwartz, C. (1997, February 1). Subject access on the web [Online] .Available: http: / /www.simmons.edu/Çschwartz/mysub.html [1997,September 11].

Scoville, R. (1996, January). Special report: Find it on the Net! [On-line] . PC World, 14(1) . Available: http: / /www.pcworld.com/re-prints/ lycos.htm [1997, September 8].

Slot, M. (1996, December 5). The matrix of Internet catalogs andsearch engines [Online] . Available: http: / /www.ambrosiasw.com/Çfprefect /matrix/matrix.shtml [1997, September 2].

Smith, D., Moxley, D., & Maze, S. (1997). Exploiting search engines.Business and Finance Bulletin, 105, 17–22.

Stix, G. (1997). Finding pictures on the Web. Scientific American,276(3) , 54–55. (Inset within Lynch, 1997)

Stobart, S., & Kerridge, S. (1996, November 8). W WW search enginestudy [Online] . Available: http: / /osiris.sunderland.ac.uk/sst /se/[1997, September 2].

Su, L. T. (1997). Developing a comprehensive and systematic modelof user evaluation of Web-based search engines. In M. E. Williams(Ed.) , National Online Meeting: Proceedings—1997 (pp. 335–345).Medford, NJ: Information Today.

Sullivan, D. (1997, August 5) . Search engine watch [Online] . Avail-able: http: / /searchenginewatch.com/ [1997, September 2].

Swain, M. J., Frankel, C., & Athitsos, V. (1996, July) . WebSeer: Animage search engine for the World Wide Web (University of ChicagoTech. Rep. TR-96-14) [Online] . Available: http: / /www.cs.uchicago.edu/Çswain/pubs/TR-96-14.pdf [1997, September 2].

Tomaiuolo, N. G., & Packer, J. G. (1996a). An analysis of Internetsearch engines: Assessment of over 200 search queries. Computersin Libraries, 16(6) , 58–62.

Tomaiuolo, N. G., & Packer, J. G. (1996b, May 20). Results of 200subject searches in AltaVista, Infoseek, Lycos, Magellan and Point,performed Oct. to Dec. 1995 [Online] . Available: http: / /neal.ctstateu.edu:2001/htdocs/websearch.html [1997, September 2].

Venditto, G. (1996). Search engine showdown. Internet World, 7(5) ,79–86.

Visualizing subject access for 21st century information resources [On-line] . (1997, February 12). Available: http: / /edfu.lis.uiuc.edu/dpc/[1997, September 11].

Westera, G. (1997, July 4). Robot-driven search engine evaluation:Overview [Online] . Available: http: / /www.curtin.edu.au/curtin/library/staffpages/gwpersonal/senginestudy/ [1997, September 2].

White, J. (1996). Tricks of the agent trade: General magic conjuresPDA agents [Interview]. Internet World, 7(5) , 67–76.

ZDNet. (1997, August 21). Whole Web catalog/search [Online]. Avail-able: http://www5.zdnet.com/zdwebcat/content/search/ [1997, Sep-tember 2].

Zorn, P., Emanoil, M., Marshall, L., & Panek, M. (1996, May). Advancedsearching: Tricks of the trade [Online]. Online, 21(3). Available: http://www.onlineinc.com/onlinemag/MayOL/zorn5.html [1997, Septem-ber 2].

Hermans, B. (1997, March 3). Intelligent software agents on the Internet[Online] . First Monday, 2(3) . Available: http: / /www.firstmonday.dk/issues/ issue2_3/ch_123/index.html [1997, September 11].

Infoseek Corporation. (1997, September 8). Infoseek patents Internetsearch technique [Online] . Available: http: / /software.infoseek.com/patents/dist_search/press_release.html [1997, September 8].

International Federation of Library Associations and Institutions. (1997,August 5) . Digital libraries: Metadata resources [Online] . Avail-able: http: / /www.nlc-bnc.ca/ ifla/II /metadata.htm [1997, September10].

Kimmel, S. (1996). Robot-generated databases on the World WideWeb. Database, 19(1) , 40–49.

Koch, T. (1997, June 10). Literature about search services [Online].Available: http://www.ub2.lu.se/desire/radar/lit-about-search-services.html [1997, September 2].

Lake, M. (1997, August 10). 2nd Annual Search Engine Shoot-out:AltaVista, Excite, HotBot, and Infoseek square of f [Online] . Avail-able: http: / /www4.zdnet.com/pccomp/features/excl0997/sear/sear.html [1997, September 2].

Larsen, R. L. (1997, April) . Relaxing assumptions, stretching the vision[Online] . D-Lib Magazine. Available: http: / /www.dlib.org/april97/04larsen.html [1997, September 6].

Leighton, H. V., & Srivastava, J. (1997, June 16). Precision amongWorld Wide Web search services (search engines): AltaVista, Excite,HotBot, Infoseek, Lycos [Online] . Available: http: / /www.winona.msus.edu/is-f / library-f /webind2/webind2.htm [1997, September 2].

LEO: Librarians and Educators Online. (1997, June 17). Search insider[Online]. Available: http://www.searchinsider.com/index.html [1997,September 2].

Lynch, C. (1997). Searching the Internet. Scientific American, 276(3) ,52–56.

Maze, S., Moxley, D., & Smith, D. J. (1997). Authoritative guide toWeb search engines. New York: Neal-Schuman.

McKiernan, G. (1997, August 28). Beyond bookmarks: Schemes fororganizing the Web [Online] . Available: http: / /www.public.iastate.edu/ÇCYBERSTACKS/CTW.htm [1997, September 11].

McMurdo, G. (1995). How the Internet was indexed. Journal of Infor-mation Science, 21, 479–489.

Melee Productions. (1997, September 1). Melee’s indexing coverageanalysis [Online] . Available: http: / /www.melee.com/mica/ Septem-ber 01, 1997 [1997, September 2].

National Institute of Standards and Technology. (1997, July 11). TextREtrieval (TREC) home page [Online] . Available: http: / /www-nlpir.nist.gov/TREC/ [1997, September 11].

Overton, R. (1996, September) . Search engines get faster and faster,but not always better [Online] . PC World, 14. Available: http: / /www.pcworld.com/workstyles/online/articles/sep96/1409_engine.html [1997, September 2].

Peterson, R. E. (1997, February) . Eight Internet search engines com-pared [Online] . First Monday, 2(2) . Available: http: / /www.firstmonday.dk/issues/ issue2_2/peterson/ [1997, September 2].

Pollock, A., & Hockley, A. (1997, March). What’s wrong with Internetsearching? [Online] . D-Lib Magazine. Available: http: / /www.dlib.org/dlib/march97/bt /03pollock.html [1997, September 6].

Randall, N. (1997, June 7). The search engine that could [Online] .

982 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—September 1998/ 8n56$$1228 07-08-98 12:52:53 jasba W: JASIS