Michael Hunter Reference Librarian Hobart and William Smith Colleges

Search and the Net at 2004Trends, Challenges and Cutting-Edge Developments in Internet Search ServicesMichael HunterReference LibrarianHobart and William Smith Colleges

for Rochester Regional Library CouncilMember Libraries StaffSponsored by the Rochester Regional Library Council

Supported by Library Services and Technology Act (LSTA) and/or Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2003

For Today .State of the Net and its UsersSearch Industry OverviewRecent Developments in Established ServicesNew ServicesThe Deep Web at 2004Tracking the Living Web: Weblogs and RSSCutting-edge DevelopmentsTrends and Challenges to Todays Search Services

The Internet and its Users at 2004

How large is the Web?What do you mean by the Web?The totality of all Web sitesSounds simple .BUT IS IT?

UC Berkeleys How Much Information Projecthttp://www.sims.berkeley.edu/research/projects/how-much-info-2003/internet.htmNOTE: 10 terabytes = total print collections of the Library of Congress

Internet Use Worldwide

Internet Use in the UShttp://www.pewinternet.org

Top Ten things our users do onlinehttp://www.pewinternet.org

ACTIVITY% E-mail92Use search eng. For specific question88Consumer info.83Get a map79Hobby info.76

Top Ten things our users do onlinehttp://www.pewinternet.org

Leisure info.73Weather75Get news69Instant message67Health/medical info.66

Undergraduates and Search EnginesColaric, S. Instruction for Web Searching: An Empirical Study College and Research Libraries 64 (2) March 2003 p. 111-116

QUESTION%YES%NO%DontKnowAll work the same way671023Engines look at all sites641818Term(s) need to match index581923Gathers sites using a crawler15996OR retrieves more than AND621820

The Internet Search Industry:ConsolidationPerformance MeasuresPopularity

The Shrinking Search IndustryEditorial control of search is shared among fewYahoo owns AlltheWeb, Altavista, Inktomi, Overture (paid listings)GoogleMSNAskJeeves owns TeomaLookSmart owns WisenutGigablastNOTE: Ownership is different from database affiliation

GoogleDatabase Affiliates

Database Freshnesshttp://www.searchengineshowdown.com/stats/freshness.shtmlBased on a series of 6 current topic searches Pages that are updated dailyAND report that date on the pageQueries submitted May 17, 2003

Database Freshnesshttp://www.searchengineshowdown.com/stats/freshness.shtml

Most have some results indexed in the last few days The bulk of most of the databases is about 1 month old Some pages may not have been re-indexed for much longer

Popularity: Searches per day self-reported data, as of 2/28/03http://searchenginewatch.com/reports/article.php/2156461

SERVICESearches, in MillionsGoogle250Overture167Inktomi80LookSmart45FindWhat33AskJeeves20AltaVista18AlltheWeb12

Recent Developments among Established Services

GoogleFrooglePhonebookWildcard WordsInfo:Synonym featureSupplemental IndexSearch by locationNews Advanced Search and News Alerts???

Froogle

Locates information about products for sale onlineGives URLs of sites offering the itemProvides links to exact page in the site where you can make the purchase

FroogleRanking follows normal Google ranking processesPaid placements always clearly markedPrice range limits availableAccess at http://froogle.google.com or via Google Advanced Search

Phonebook Command SearchSearches US residential (rphonebook:) and business (bphonebook:) listings of Yahoo, MapQuest and other servicesrphonebook:MUST INCLUDE Last nameCity and/or StateMAY INCLUDEFirst namebphonebook:MUST INCLUDEBusiness name (min. 1 word) City and/or State MAY INCLUDEFull Business name

Wildcard Words

Google offers a word-sized asterisk to function as a wildcardStands for a whole wordCannot be used for part of a wordthree * mice = 22,000three bl* mice = 0

Wildcard WordsSeveral * can be used together

milosevic International * * HagueRetrieves military tribunal OR military court OR war tribunal OR military tribunal

info: Not exactly hidden, but not well-knownSearches for any information Google has about a siteConvenient way to monitor linkageTyping a URL in the search box will give the same results

Synonym FeaturePlace a tilde ~ immediately before a term to retrieve synonyms or related terms from the Google IndexEliminate the original term by placing a minus sign before it.~hiking -hiking

Googles Supplemental Index

For obscure or unusual searchesQueried when Google fails to find good matches within its main web index.Live 9/9/03Sample queries:St. Andrews United Methodist Church Homewood ILnalanda residential junior college alumniillegal access error jdk 1.2b4supercilious supernovas

Search by Location (beta)http://labs.google.com/locationU.S. onlyKeyword(s) combined with address, city, state or zipSearch results appear on a map

News Advanced Searchand News AlertsAdvanced News Search added this FallNews AlertsRequires a (free) accountOne query per alert; limit of 50 alerts per e-mail addressAlerts contain links to news containing your alert keywordsCannot edit a query; delete and create a new one insteadAlerts sent once a day or as it happens

More about Google.Google World http://indicateur.comMaintained by a French Search Engine Site and listed under Guides. Use Google translator (see Language Tools) to translate the site)Google Lab http://labs.google.comPlace for cutting edge developments, many in beta awaiting user feedback and testing.

Beyond Google: AskJeevesSimpler, cleaner interfaceTeoma crawler-based results blended with AJ answersImproved image databaseSmart AnswersPopular queries mapped to news, image and other sources appropriate to the query

ATW (FAST)http://alltheweb.comContinued commitment to a large database (2nd to Google)Powerful, new advanced search capabilitiesExtensive page customization optionsResults clustered by topic (Folders)Both HTML and Multimedia given, when available NOTE: Folders located at the BOTTOM of each results screen

AltavistaSimpler interfaceMore language optionsExpanded image and multimedia collectionsResults labeledRefreshed in last 48 hours Includes PDF filesUS and Local search optionsPrisma query refinement

AltavistaPrisma Query RefinementOffers a maximum of 12 terms having the strongest associations with the original query term(s)Selected from the top 50 results of the original queryNOTE: Clicking on a Prisma term adds it to your original query, creating a new set of Prisma terms.Similar to Refine (1997) but less graphic

TeomaRanking Includes a sites relationship to other sites with similar contentResultsRanked database results, with Related PagesRefineClustering of your results and other related sites based on term relationships and web community linkages derived from your original resultsResourcesLink Collections from experts and enthusiasts(Subject metasites)

HotbotSearches Hotbot (Inktomi) OR Google OR Lycos OR AskJeevesNot a true metaengineAdvanced features operable only if supported by source engines

MetacrawlerAlong with Dogpile and Webcrawler, owned by InfospaceSimpler interfaceOffers the following customizations:Selection of sources searchedTotal number of results retrievedLength of search (time-out period)Offers a wide range of vertical searches: Images, MP3, Shopping, Subject Directory, Multimedia, News, Message Boards

New Services Attracting Attention

GigablastLaunched April, 2002Smaller database than othersOver 200 million on 10/4/03pope canterbury Google:83,200 Gigablast:24,919Created and maintained by Matt Wells (alone)Only search engine continuously updated with index refreshed in real time (Site submissions are immediately searchable)Ranking depends less on linkage than Googles ranking, to avoid penalizing newer pages.No advertising (to date)

Gigablast Search FeaturesBasic search Full Boolean Advanced Search: Full Boolean and 2 (!) phrase boxesLimit by siteLimit by domain (URL)Links to a page availableMost generic html metatags indexed, searched and made available for displayUnique to Gigablast!!!

Gigablast Search FeaturesField searches include title, IP address and non-html filetypes:PDF, Word, Excel, PPT, PostScript, Ascii TextResults from one site clusteredCached version availableResults include date indexed and last modified (!!)Linking to Gigablast improves ranking there

KillerInfohttp://www.killerinfo.comMetaengine searching Google, AOL, Lycos, Gigablast, MSN, Altavista, LookSmart and Open Directory9 topical Deep Web channels offeredBoolean and phrase searchNo other Advanced Search featuresResults clustering (a la Vivisimo)Number of results not givenAdult content filter

Surfwaxhttp://surfwax.comDemo site for federated search softwareSimultaneous search of Deep Web, Intranets, Web and moreMetaengine searches Wisenut, AOL, MSN, Yahoo, Incarta, CNN, LookSmartFOCUS search refinement featureOnline thesaurus of related terms and definitions

Surfwaxhttp://surfwax.comSite SNAP of a result offersAuthor summary (from metatags)Related sitesSites FOCUS wordsKey Points (query-related sections)Results ranking options: Relevance, Alpha and SourcePreferences and Advanced Features require a (free) account; more options available to fee-based accounts

Nutchhttp://nutch.orgProject to implement an open source web search engineWhy open source?With open source, search results processing is transparent, not hidden. Bias (if any) can be examined by anyone.Open source applications are free and available for use, modification or for-profit use. Users are asked to contribute their innovations back to the code baseNutch is seeking volunteer developers and donations

The Deep Web at 2004

The Topography of the Internetor The Layers of the WebMapping the web is challengingUnregulated in natureInfluences from all over the globeFulfills many purposes, from personal to commercialChanges rapidly and unexpectedlyDivisions and terminology are inherently ambiguous eg. Deep vs Invisible Web

May I suggest a biological, nautical metaphor, perhaps the ocean?SURFACE WEBSHALLOW WEBOPAQUE WEBDEEP WEBDARK WEB

Surface WebStatic html documents

Crawler-accessible

Shallow WebStatic html documents loaded on servers that use ColdFusion or Lotus Domino or other similar softwareA different URL for the same page is created each time it is served.Crawlers skip these to avoid multiple copies of the same page in their databaseTechnically human accessible via directories, Deep Web gateways or links from other sites

Opaque WebStatic html documentsTechnically crawler accessible2 types: Downloaded and indexed by crawlerNot downloaded or indexed by crawler

Opaque WebDownloaded and indexed by crawlerBuried in search results you never look atA casualty of relevance rankingNot downloaded or indexed by crawler due to programmed download limitsDocument buried deep in the sitePart of a large document that did not get downloaded (Typical crawl per page is 110 K or less)Document added since last crawler visit (Even the best revisit on an average of every 2 weeks, depending on amount of change at a site)

Opaque WebAccess to the Opaque Web Specialized search enginesGeneral and specialized directoriesSubject metasitesThese services typically index more thoroughly and more often than large, general search engines

Deep WebTechnically inaccessible to crawlersDynamically created pagesDatabasesNon-textual filesPassword protected sitesSites prohibiting crawlersTechnically accessible to crawlersTextual files in non-html formats

Dark Webhttp://research.arbornetwords.comUp to 5% of the web is completely unreachable due toMisconfigured routersContractual disputes between ISPsBroadband users with personal or corporate firewallsUS Military sites

UC Berkeleys How Much Information Projecthttp://www.sims.berkeley.edu/research/projects/how-much-info-2003/internet.htmNOTE: 10 terabytes = total print collections of the Library of Congress

http://www.sims.berkeley.edu/research/projects/how-much-info-2003/internet.htm

Reducing the Deep Web:mod_rewriteMaking dynamic pages available to crawlersMod_rewrite software loaded onto a web server containing dynamic pages (databases, etc)Crawler follows a link to a stable URL on the server www.mydomain.com/dvdplayers.htmlMod_rewrite searches all the servers dynamic pages containing dvdplayers and creates temporary pages with stable URLs. These pages are linked to each other, creating a stream of virtual pages that can be crawled by any of the search enginesSearch engines often check the stream for spam or duplicate pages

Mining the Deep Web:Directed Query Engines or Intelligent AgentsDesigned to access distributed Deep Web resourcesSome can be configured to search specific URLsDatabasesSubject metasitesreport collectionsdynamic pagesonline newsletters

Directed Query Engines for purchaseSimultaneous search of Deep Web and other resources with many additional featuresLexibot http://www.lexibot.comIf you complete survey: $189 upgrades $15If you dont: $289 upgrades $50BullsEye http://info.intelliseek.comBullsEye Pro:$199 with free upgrades for 6 months

Hunters Maximfor the Deep Web

Plan to first locate the category of information you want, then browse. Dont be too specific in your searches. Cast a wide net.

TRACKING THE LIVING WEB:WEBLOGS AND RSS FEEDS

Blogs: What are they?Online diaries or journals, usually by one person, though many invite commentsFirst developed in 1997Within the same blog tone can range from personal musings to discussion of recent issues in technology and researchHigh link-to-word ratioOften link to other weblogs of similar content

Blogs: What are they?Can contain rumor, inside information, speculation, blatant errors as well asBreaking news: political and technical/researchCommentary on new software or websitesConsumer reaction to products or servicesBlog authoring tools are basic content management software, useful in ways other than online diariesTypify the spirit of information sharing that has fueled the Internet since its beginnings

How large is the blogosphere?2.4 to 2.9 million active blogs (est.)

Whos blogging?Jupiter Research2% of Internet users have created a blogAbout 50% women, 50% menOver 50% are in English; remaining language, in order of prevalence:Portuguese, Polish, Farsi, French, Spanish, German, Italian, Dutch and Icelandic

More About 4% of Internet users read blogs, 60% men, 40% womenOn average, blogs are updated every 3 days About 4% of online Americans have gone to blogs for information about the Iraq WarLiveJournal (large blog host) was the 650th most popular site on the Internet (May, 2003)184,000 readers every 10 daysSpend average of 22 minutes at the site

Creating a Blog Blogger http://new.blogger.comFree, automated Web publishing toolRequires no new softwareSend posts to an existing website or create a free blog at BloggerProvide a site template and where you want the postings to appearTo update, create posting, submit permission form and Blogger will sent FTPAdvanced options available

Locating BlogsBlog Hosting Siteswww.livejournal.comdiaryland.comradio.userland.com ($39.95 with added features)Blog metasiteswww.lights.com (library-related, world-wide)www.blogrunner.comwww.llrx.com/columns/notes46.htmportal.eatonweb.com/

Locating BlogsSubject Directoriesdmoz.org/Computers/Internet/On_the_WebGeneral Search EnginesBlog keyword(s) or URL(bloghost) keyword(s)Professional Association homepagesSubject MetasitesUse Teoma.com Resources

Searching Blog ContentBlog hosting siteswww.livejournal.comBlog Search EnginesFeedster.com (includes RSS feeds also)Daypop.com (current events)Blogdex.media.mitwww.technorati.comblogging-news.infoTopical Blog Search EnginesDetod (blawgs.detod.com) Exclusively legal weblogs

Blogs and General Search EnginesBlog-rich sites are increasingly visited by major crawler-based search services

HOWEVER

ANY rapidly-changing content can easily be missed by crawlers

Obstacles to Crawling and Indexing Blog ContentOnly the most recent postings appear on the blog homepage (older are archived, and inaccessible to crawlers)Many bloggers post dozens of times a dayFrequent postings may contain critical information to time-sensitive topicsEven a daily crawl would miss these postings (typical crawl is about once every 3 weeks)

Obstacles to Crawling and Indexing Blog Content Page DesignSeveral postings usually appear on the blog homepagePostings are NOT indexed separately, as crawler indexes the page as a wholeRetrieval of an individual posting on a topic is unreliable

Blogs and LibrariesBlogs can offer an opportunity to post content on the Web quicklyno delay of FTP uploading or submission to a webmasterWhats NewFavorite BooksRecent AcquisitionsProgram Changes due to the Weather

Blogs and LibrariesGet more people involved in posting content on the Library (or library-sponsored) websiteNo knowledge of html, RSS or XML neededLog onto the blog hosting website, create content, and update the pageCurrent awareness without the annoyance of un-wanted e-mailsChoose when YOU want that information by visiting your blogs of choice

Blogs and Libraries:MetasitesBlogs and Libraries: A Bibliography (online)http://www.etches-johnson.com/nolibrary/bib.htmlLibrary Weblog Directoryhttp://www.libdex.com/weblogs.htmlBlogs at the University of Minnesota Librarieshttp://www.lib.umn.edu/san/mt/Fichter, D. (2003). Why and how to use blogs to promote your library's services. Marketing Library Services 17(6). http://www.infotoday.com/mls/nov03/fichter.shtml

RSSRich Site SummariesReally Simple SyndicationReally Stops Spam

Before RSS:Tracking latest news and site updatesSoftware packages that monitored and reported changes at sites of your choosingNews alert services, free and feeManual checking of your bookmarksHit or miss Listserv and Usenet postings

RSS: What is it?XML filetype with content that isStructured (tags, standard and/or author-defined)Re-useable (can be integrated into web, e-mail, multimedia and many other formatsOriginally developed by Netscape as a content management tool for personalizing home pagesMy News My Sports My WeatherRSS in detail http://blogs.law.harvard.edu/tech/rss

RSS: What can it do?Creates a broadcast version of frequently updated content from a website, blog, news page or other sourceAuthors can Summarize new contentBroadcast new content eg. online newslettersCan be used as a way to distribute content to subscribers (syndication) independent of e-mail. Subscribers logon or access via aggregators.

How do I access them?As RSS is in XML, may require downloading reader software (older versions of browsers cannot read XML). Sources for reader software includewww.lights.comblogspace.comSites with RSS feeds display a small icon (usually orange) labeled RSS or XMLGeneral search engines (limited, but worth a try)filetype:xml keyword(s)

RSS Directories and Search EnginesSyndic8 syndic8.comDirectory of available syndicated news feedsProvides no reading areaUses Open Directory classificationFeedster www.feedster.com The best search engine for blogs and RSS feedsYahoo news.yahoo.com/rssCanadian Government tinyurl.com/vrh7Often found in Blog Directories and Engines

RSS aggregatorsReceive general or topical RSS feeds and blog postingsMany are focused on news onlyPresent content in compact formCombine multiple sources in one interfaceProvide links to full contentIn personal desktop versions or online

Personal desktop aggregatorsLets you specify any feeds you want access toAmpheta Desk www.disobey.com/amphetadesk/Radiouserland radio.userland.com ($$)Feedreader feedreader.com

Feedreader.com

Online aggregatorsSelection of feeds may be limitedNewsIsFree NewsIsFree.com7379 sources grouped into 16 channelsCreate custom pages$$ offers more Premium optionsMany RSS sites include links to other aggregators

Authoring and Producing RSS Lockergnome rss.lockergnome.comDocuments, tools, developers, aggregators, free feed generator for you siteRSS Primer for Publishers www.eevl.ac.uk/rss_primer/Producing RSS feedsTechnical informationFeed promotionFeedster www.feedster.com

Blogs and RSSBlogs may offer some or all of their content as RSS feeds, or notBlogs can exist as pure html documents, updated frequentlyMaking content available in RSS increases a blogs access and exposure via aggregators and other RSS-based search services

The Living WebWhat can blogs and RSS feeds tell us about an authors point of view?Which ones does an author list on their blog/homepage?Which ones does an author visit/subscribe to?Sometimes I want to know what the world thinksGOOGLESometimes I want to know what I thinkMY WEBLOGSometimes I want to know what those I respect thinkBLOGS AND FEEDS I READ

Beyond todays(free) search engines:Cutting edge developments

Including Context in System DesignContext matters (!!??!)Textual contextQuery context Who is asking and why?Traditional approaches to retrieval have been deductiveData organized and mapped to anticipated query terms (controlled vocabularies, taxonomies)Human created and maintainedToo slow for rapid data streams

Bayesian approachesUses statistical inference based on Bayes Theorem of Probability (Thomas Bayes, 1702-1761)Inductive approach (adaptive processing)Take the users information environmentInfer structures, relationships, likely queriesInferred structures and relationships can then be mapped to a human-created classification schemeCurrently used in corporate intranet and fee-based content management softwareWill be used more in general information systems of the future

Adaptive ProcessingLearning the searchers interestsWhat term(s) did you search?What did you select?How long did you look at it?What is its source?How old was it?Direct input from searcherRank the sourcesRate individual resultsEliminate certain sources, sites

Inquirushttp://inquirus.nj.nec.comQuery interface research projectAttempts to improve precision of resultsMonitors users search behavior to infer intent of queriesRe-formulates queries to increase likelihood of desired answers

Inquirushttp://inquirus.nj.nec.comUSER: How do you make salsa?SYSTEM: salsa and (recipe or ingredients or food)Eliminates pages on salsa dancing

Ranking relies heavily on proximity of query terms and system-provided cognates to each other in the document

Vector-Space Model3-dimensional retrievalA way of ordering documents by word frequency/context in a term spaceand matching them to queriesDocuments are assigned coordinatesOne document may be in many term spacesor vectorsQueries that fall within a given vector are likely to be answered by documents located in that vector

A Multi-dimensional BooleanBoolean limited to term matches

Vector-space modelMore complex relationships can be mappedDegrees of relatedness of document to queryQuery and document weights based on length and direction of their vectorsfemaleterrierpuppy

Documents in Vector SpaceWhat do you have on movie stars diets?STARDoc about astronomyDoc about movie starsDoc about mammal behaviorDIET

Phibothttp://phibot.orgProject of the Univ. of Mainz and German Institute of Artificial IntelligenceCrawls science, medicine and news web sites`200 million general science sites70 million medical sitesTraditional: Google-like processingVector-SpaceOptimization: greater vector-space processing

Digital Video SearchSearches actual visual contentProject of Dublin City Universityhttp://www.cdvp.dcu.ieDetermine structure of the video by identifying shots with the greatest degree of change (keyframes)Use these to create a structure, and allow user to refine query based on theseNeeded by journalists, governments and airport security

Current Trends in and Challenges to Todays Search Industry

User Interface TrendsToolbars, Toolbars, everywhereReview site: searchenginewatch.com/links/article.php/2156381Search by Location Major engines with local search options and local specialized onesMakes the haystack smaller; important in e-commerceP2P networks (Peer-to-peer)File-sharing networks, a la NapsterKaZaA - most popular download EVER!Shares any filetype90% of files shared are audio-visual in nature

User Interface TrendsApplication Program Interface (API)Published set of programming hooksthat lets you interact directly with a companys open serversYou can mine the companys databases for freeWHY? To attract more traffic to the siteExample http://www.googlerace.comEnter 1 or 2 terms/phrases and see how Bush and Democratic candidates stack up!Created by Tara Calishain

Search in Corporate Settings Drive Search Engine R&DUniform, seamless access to all information:Internal & external, data & contentXMLMore natural language processingHybrid systems to search structured AND unstructured dataAdaptive processing (Bayesian)Use of intelligent agent softwareEasier user interfacesPersonalization

Industry-wide TrendsDistributed CrawlingVolunteer your PC when not in useGrub.com, LooksmartSearch continues to be driven by advertising and revenueFewer services maintain their own crawler-created databaseIncreased crawling of non-html filetypes

Challenges to the IndustryRevenueE-content providers have cut into search software sales with their proprietary enginesFighting fraudCloaking, ranking manipulationScalabilitySize of surface Web increasesOver 300 million queries a day to all Web S.E.s

Challenges to the IndustryFreshnessCompetitive edge demands recent crawlsDeep WebEmbedded databasesNon-html filetypesReal-time informationGrowing importance of the Living Web

Challenges to the IndustryAmbiguous query refinementNot very hopeful among general search enginesUser group too largeUser profiling difficultIndexing the smaller, newer sitesGoogles link-based PageRank penalizes these sites

The Biggest Challenge:Just what are you looking for?A known needle in a known haystackA known needle in an unknown haystackAn unknown needle in an unknown haystackAny needle in a haystackThe sharpest needle in a haystackMost of the sharpest needles in a haystackAll the needles in a haystack

The Biggest Challenge:Just what are you looking for?Affirmation of no needles in the haystackThings like needles in any haystackLet me know if any new needles show upWhere are the haystacks?Needles, haystacks, .whatever

Thank You andHappy Holidays!

Michael HunterReference LibrarianHobart and William Smith CollegesGeneva, NY 14456

(315) [email protected]

Michael Hunter Reference Librarian Hobart and William Smith Colleges

Documents

Transcript of Michael Hunter Reference Librarian Hobart and William Smith Colleges