Mining the Michael Hunter Reference Librarian Hobart and William Smith Colleges For Western New York...

74
Mining the Mining the Michael Hunter Michael Hunter Reference Librarian Reference Librarian Hobart and William Smith Colleges Hobart and William Smith Colleges For For Western New York Library Resources Western New York Library Resources Council Council Member Libraries’ Staff Member Libraries’ Staff Sponsored by the Sponsored by the Western New York Library Resources Council Western New York Library Resources Council
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Mining the Michael Hunter Reference Librarian Hobart and William Smith Colleges For Western New York...

Mining theMining the

Michael HunterMichael HunterReference LibrarianReference Librarian

Hobart and William Smith CollegesHobart and William Smith Colleges

ForFor

Western New York Library Resources Western New York Library Resources CouncilCouncil

Member Libraries’ StaffMember Libraries’ Staff

Sponsored by the Sponsored by the Western New York Library Resources Western New York Library Resources

Council Council

For today . . .For today . . .

From Web to Deep WebFrom Web to Deep Web Search Services: Search Services: Genres and Genres and

DifferencesDifferences The Topography of the InternetThe Topography of the Internet Mining the Deep Web:Mining the Deep Web: Techniques and Techniques and

TipsTips Hands-on SessionHands-on Session Evaluating Deep Web ResourcesEvaluating Deep Web Resources Using Proprietary Software Using Proprietary Software

Web to Deep WebWeb to Deep Web

1991 – Gopher1991 – Gopher• Menu-based text onlyMenu-based text only• You had to KNOW the sitesYou had to KNOW the sites

1992 – Veronica1992 – Veronica• Menus of menusMenus of menus• Difficult to accessDifficult to access

Web to Deep WebWeb to Deep Web

1991 - 1991 - HHyper-yper-TText ext MMarkup arkup LLanguageanguage• Linkage capability leads you to Linkage capability leads you to

related information elsewhererelated information elsewhere ““Classic” Web SiteClassic” Web Site

• Relatively stable content of static, Relatively stable content of static, separate documents or filesseparate documents or files

• Typically no larger than 1,000 Typically no larger than 1,000 documents navigated via static documents navigated via static directory structuresdirectory structures

Web to Deep WebWeb to Deep Web

1994 – Lycos launched1994 – Lycos launched• First crawler-based search engine with First crawler-based search engine with

database of 54,000 html documents database of 54,000 html documents (CMU)(CMU)

Growth of html documents Growth of html documents unprecedented and unanticipatedunprecedented and unanticipated• 2000 (April) “The Web is doubling in 2000 (April) “The Web is doubling in

size every 8 months” (FAST)size every 8 months” (FAST)

Web to Deep WebWeb to Deep Web 1996 – Three phenomena pivotal for 1996 – Three phenomena pivotal for

the development of the Deep Web:the development of the Deep Web: HTML-based database technology HTML-based database technology

introducedintroduced• Bluestone’s Sapphire/Web, OracleBluestone’s Sapphire/Web, Oracle

Commercialization of the WebCommercialization of the Web• Growth of home PC-users and e-commerceGrowth of home PC-users and e-commerce

Web Servers adapted to embrace Web Servers adapted to embrace “dynamic” serving of data“dynamic” serving of data• Microsoft’s ASP, Unix PHP and othersMicrosoft’s ASP, Unix PHP and others

Web to Deep WebWeb to Deep Web

1998 – Deep Web comes of Age1998 – Deep Web comes of Age

Larger sites redesigned with a Larger sites redesigned with a database orientation rather than database orientation rather than static directory structurestatic directory structure• U.S Bureau of the CensusU.S Bureau of the Census• Securities and Exchange CommissionSecurities and Exchange Commission• Patent and Trademark OfficePatent and Trademark Office

Search Services:Search Services:Genres and DifferencesGenres and Differences

Exclusively crawler-createdExclusively crawler-created• Search enginesSearch engines• Meta search enginesMeta search engines

Human created and/or influencedHuman created and/or influenced• DirectoriesDirectories• Specialized search enginesSpecialized search engines• Subject metasitesSubject metasites• Deep Web gateway sitesDeep Web gateway sites

DATABASE

CRCR

CR

CR

CR

CR

WSWS

WS

WSWS

WS

WS

WSWS

WS

CR - Crawler WS - Web Server

DATABASE

Search Engine

User 1

User 2

User 3

User 4

User 5

User 6

User 7

Search Services:Search Services:Exclusively Crawler CreatedExclusively Crawler Created

Database compiled through Database compiled through automated, automated, link-dependentlink-dependent crawling and site submissioncrawling and site submission

Unable to accessUnable to access• Dynamically-created pagesDynamically-created pages• Proprietary, non-html filetypesProprietary, non-html filetypes• MultimediaMultimedia• SoftwareSoftware• Password-protected sitesPassword-protected sites• Sites prohibiting crawlers (robots.txt Sites prohibiting crawlers (robots.txt

exclusion)exclusion)

Dynamically-created Web Dynamically-created Web pagespages

Created at the moment of the query Created at the moment of the query using the most recent version of the using the most recent version of the database.database.

Database-drivenDatabase-driven Require interactionRequire interaction

• Amazon.comAmazon.com What titles are available? At what price? What titles are available? At what price? Are there recent reviews? What about shipping?Are there recent reviews? What about shipping?

Used widely in e-commerce, news, Used widely in e-commerce, news, statistical and other time-sensitive sites.statistical and other time-sensitive sites.

Dynamically-created Web Dynamically-created Web pagespages

Why can’t crawlers download them?Why can’t crawlers download them?

TechnicallyTechnically they they cancan interact, within interact, within limits of programming capabilitylimits of programming capability

Very costly and time-consuming for Very costly and time-consuming for general search servicesgeneral search services

Dynamically-created Web Dynamically-created Web pagespages

How can a crawler detect a How can a crawler detect a dynamically-created page?dynamically-created page?• From any of the following in the URLFrom any of the following in the URL

? , % , $ , = , ASP , PHP , CFM ? , % , $ , = , ASP , PHP , CFM and othersand others

proquest.umi.com/pqdweb?proquest.umi.com/pqdweb?Did=000000209668731&Fmt=1&Deli=1&Mtd=1&IdxDid=000000209668731&Fmt=1&Deli=1&Mtd=1&Idx=5&Sid=1&RQT=309=5&Sid=1&RQT=309

Proprietary FiletypesProprietary Filetypes

PDFPDF SpreadsheetsSpreadsheets Word-processed documentsWord-processed documents

Google does it! Why can’t Google does it! Why can’t you?you?

Google’s Deep Web Components: Google’s Deep Web Components: Non-html filetypes (1.75%)Non-html filetypes (1.75%)

SEARCH SYNTAXSEARCH SYNTAX

“california power shortage” filetype:pdf “california power shortage” filetype:pdf Adobe Portable Document Adobe Portable Document

Format (pdf) Format (pdf)

Adobe PostScript (ps) Adobe PostScript (ps)

Lotus 1-2-3 (wk1, wk2, Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wkwk3, wk4, wk5, wki, wk

Lotus WordPro (lwp) Lotus WordPro (lwp)

MacWrite (mw) MacWrite (mw)

Microsoft Excel (xls) Microsoft Excel (xls)

Microsoft PowerPoint Microsoft PowerPoint (ppt) (ppt)

Microsoft Word (doc) Microsoft Word (doc)

Microsoft Works (wks, Microsoft Works (wks, wps, wdb) wps, wdb)

Microsoft Write (wri) Microsoft Write (wri) Rich Text Format (rtf)Rich Text Format (rtf) Text (ans, txt) Text (ans, txt)

Google Non-html FiletypesGoogle Non-html FiletypesWarning!Warning!

FOR NON-HTML FILESFOR NON-HTML FILES• Clicking on a title in the results list Clicking on a title in the results list

opens the application as well, involving opens the application as well, involving risk of a virus or worm that may be risk of a virus or worm that may be attached to the fileattached to the file

• INSTEADINSTEAD, , click the “View as HTML” click the “View as HTML” option; no applications will be opened option; no applications will be opened and no risk of virus or wormand no risk of virus or worm

• NOTE: Titles for non-html files are NOTE: Titles for non-html files are frequently not descriptive of contentfrequently not descriptive of content

““homeland security” filetype:ppthomeland security” filetype:ppt

Search ServicesSearch ServicesHuman created or influencedHuman created or influenced

Directories – general and Directories – general and specializedspecialized

Specialized search enginesSpecialized search engines Subject metasites or gatewaysSubject metasites or gateways Deep Web gatewaysDeep Web gateways

Search ServicesSearch ServicesHuman created or influencedHuman created or influenced

Content of sites is examined and Content of sites is examined and categorized or crawling is human-focused categorized or crawling is human-focused and refinedand refined

CAN CAN include sites with dynamically include sites with dynamically created pagescreated pages

CANCAN be limited to database-driven sites be limited to database-driven sites (Deep Web)(Deep Web)

CANCAN include non-html files include non-html filesNOTE: Some specialized search engines NOTE: Some specialized search engines may include little human influence eg. may include little human influence eg. Search.eduSearch.edu

The Topography of the InternetThe Topography of the Internetoror The Layers of the Web The Layers of the Web

Mapping the web is challengingMapping the web is challenging• Unregulated in natureUnregulated in nature• Influences from all over the globeInfluences from all over the globe• Fulfills many purposes, from personal Fulfills many purposes, from personal

to commercialto commercial• Changes rapidly and unexpectedlyChanges rapidly and unexpectedly

Divisions and terminology are Divisions and terminology are inherently ambiguous eg. “Deep” inherently ambiguous eg. “Deep” vs “Invisible” Webvs “Invisible” Web

May I suggest a biological, nautical May I suggest a biological, nautical metaphor, perhaps metaphor, perhaps the oceanthe ocean??

SURFACE WEBSURFACE WEB

SHALLOW WEBSHALLOW WEB

OPAQUE WEBOPAQUE WEB

DEEP WEBDEEP WEB

Surface WebSurface Web

Static html documentsStatic html documents

Crawler-accessibleCrawler-accessible

Shallow WebShallow Web Static html documents loaded on Static html documents loaded on

servers that use ColdFusion or Lotus servers that use ColdFusion or Lotus Domino or other similar softwareDomino or other similar software

A different URL for the same page is A different URL for the same page is created each time it is served.created each time it is served.

Crawlers skip these to avoid multiple Crawlers skip these to avoid multiple copies of the same page in their copies of the same page in their databasedatabase

TechnicallyTechnically human accessible via human accessible via directories, Deep Web gateways or links directories, Deep Web gateways or links from other sitesfrom other sites

Opaque WebOpaque Web

Static html documentsStatic html documents Technically Technically crawler accessiblecrawler accessible 2 types: 2 types:

• Downloaded and indexed by crawlerDownloaded and indexed by crawler• Not downloaded or indexed by crawlerNot downloaded or indexed by crawler

Opaque WebOpaque Web Downloaded and indexed by crawlerDownloaded and indexed by crawler

• Buried in search results you never look atBuried in search results you never look at• A casualty of “relevance” rankingA casualty of “relevance” ranking

Not downloaded or indexed by crawler Not downloaded or indexed by crawler due to programmed download limitsdue to programmed download limits• Document buried deep in the siteDocument buried deep in the site• Part of a large document that did not get Part of a large document that did not get

downloaded (Typical crawl per page is downloaded (Typical crawl per page is 110 K or less)110 K or less)

• Document added since last crawler visit Document added since last crawler visit (Even the best revisit on an average of (Even the best revisit on an average of every 2 weeks, depending on amount of every 2 weeks, depending on amount of change at a site)change at a site)

Opaque WebOpaque Web

Access to the Opaque Web Access to the Opaque Web • Specialized search enginesSpecialized search engines• General and specialized directoriesGeneral and specialized directories• Subject metasitesSubject metasites

These services typically index more These services typically index more thoroughly and more often than thoroughly and more often than large, general search engineslarge, general search engines

Deep WebDeep WebTwo CategoriesTwo Categories

TechnicallyTechnically inaccessible to inaccessible to crawlerscrawlers

TechnicallyTechnically accessible to accessible to crawlerscrawlers

Deep WebDeep Web TechnicallyTechnically inaccessible to inaccessible to

crawlerscrawlers•Dynamically created pagesDynamically created pages•DatabasesDatabases•Non-textual filesNon-textual files•Password protected sitesPassword protected sites•Sites prohibiting crawlersSites prohibiting crawlers

Deep WebDeep Web

TechnicallyTechnically accessible to accessible to crawlerscrawlers•Textual files in non-html Textual files in non-html

formatsformats

(Google does it!)(Google does it!)•Pages excluded from crawler Pages excluded from crawler

by editorial policy or biasby editorial policy or bias

Mining the Deep WebMining the Deep Web

Techniques and TipsTechniques and Tips

How large is the Deep Web?How large is the Deep Web?

White Paper by Michael K. White Paper by Michael K. Bergman published in the Journal Bergman published in the Journal of Electronic Publishing in 2000.of Electronic Publishing in 2000.• http://www.brightplanet.com/http://www.brightplanet.com/

deepcontent/deepcontent/tutorials/DeepWeb/index.asp tutorials/DeepWeb/index.asp

Currently a scarcity of unbiased Currently a scarcity of unbiased research due to its fluid nature, research due to its fluid nature, dynamic content and multiple dynamic content and multiple points of accesspoints of access

How large is the Deep Web?How large is the Deep Web?Bergman StudyBergman Study

Over 150,000 databasesOver 150,000 databases Over 95% publicly availableOver 95% publicly available Perhaps 500 times larger than the Perhaps 500 times larger than the

Surface WebSurface Web Growth rate currently greater than Growth rate currently greater than

the Surface Webthe Surface Web

What’s in the Deep Web?What’s in the Deep Web?

Information likely to be stored in a Information likely to be stored in a databasedatabase• People, address, phone number People, address, phone number

locatorslocators• PatentsPatents• LawsLaws• Dictionary definitionsDictionary definitions• Items for sale or auctionItems for sale or auction• Technical reportsTechnical reports• Other specialized dataOther specialized data

What’s in the Deep Web?What’s in the Deep Web?

Information that is new and Information that is new and dynamically changingdynamically changing• NewsNews• Job postingsJob postings• Travel schedules and pricesTravel schedules and prices• Financial dataFinancial data• Library catalogs and databasesLibrary catalogs and databases

Topical coverage is Topical coverage is extremely extremely varied.varied.

Mining the Deep WebMining the Deep WebA world different from search engines . . .A world different from search engines . . .

Hunter’s Maxim for Searching the Deep Hunter’s Maxim for Searching the Deep WebWeb

Plan to first Plan to first locate the categorylocate the category of of information you want, then browse. Don’t information you want, then browse. Don’t be too specific in your searches. Cast a wide be too specific in your searches. Cast a wide net.net.

Brush up on your Gopher-type search skills (if Brush up on your Gopher-type search skills (if you were searching the ‘Net back then). We’ve you were searching the ‘Net back then). We’ve become accustomed to search engine free-text become accustomed to search engine free-text searching. searching. This is a different world.This is a different world.

Basic Strategies for Basic Strategies for Mining the Deep WebMining the Deep Web

Using directories, general and specializedUsing directories, general and specialized Using general search enginesUsing general search engines Using specialized (subject-focused) search Using specialized (subject-focused) search

enginesengines Using subject metasites (link-oriented)Using subject metasites (link-oriented) Using Deep Web gateway sites (database-Using Deep Web gateway sites (database-

oriented)oriented)NOTE: Many sites contain elements of all of the NOTE: Many sites contain elements of all of the

above, in varying degrees and combinationsabove, in varying degrees and combinations

Using directoriesUsing directories Yahoo! > “web directories” > 840 Yahoo! > “web directories” > 840

category matchescategory matches Yahoo! > database > 22 categories Yahoo! > database > 22 categories

and 7423 site matchesand 7423 site matches Google Directory > link collections > Google Directory > link collections >

493,000493,000 Databases may also be found under Databases may also be found under

general subject categoriesgeneral subject categories Also use research directories such Also use research directories such

as Infomine, LII, WWWVL and othersas Infomine, LII, WWWVL and others

Using general search enginesUsing general search engines

Combine subject terms with one or Combine subject terms with one or more of these possibilities:more of these possibilities:• directorydirectory• crawlercrawler• search enginesearch engine• databasedatabase• webring or web ringwebring or web ring• link collection link collection • blogblog

Using general search enginesUsing general search engines

Google (11/4/02)Google (11/4/02)““toxic chemicals database” > 45toxic chemicals database” > 45

““punk rock search engine” > 77punk rock search engine” > 77

““science fiction webring” > 97science fiction webring” > 97

(web rings are cooperative subject metasites, (web rings are cooperative subject metasites, maintained by experts or aficionados)maintained by experts or aficionados)

Remember, when using a search engine you Remember, when using a search engine you must must match words on the page.match words on the page.

Using specialized (subject-Using specialized (subject-focused) search enginesfocused) search engines

AKAAKA• Limited-area enginesLimited-area engines• Targeted search enginesTargeted search engines• Expert search servicesExpert search services• Vertical PortalsVertical Portals• VortalsVortals

Using specialized (subject-Using specialized (subject-focused) search enginesfocused) search engines

Non-html textual filesNon-html textual files• http://searchpdf.adobe.com/http://searchpdf.adobe.com/• GoogleGoogle

Non-textual filesNon-textual files• Image, MP3 search enginesImage, MP3 search engines• Media search at Google, et. al.Media search at Google, et. al.

SoftwareSoftware BlogsBlogs

• Blogdex Blogdex http://blogdex.media.mit.edu/http://blogdex.media.mit.edu/

Web logs or blogsWeb logs or blogs

Online personal journalsOnline personal journals Postings are often centered around a Postings are often centered around a

particular topic or issue and may particular topic or issue and may contain links to recent relevant contain links to recent relevant informationinformation

Frequently updatedFrequently updated Differ from newsgroups in that they Differ from newsgroups in that they

are generally by one authorare generally by one author

Web logs or blogsWeb logs or blogs

How do you search them?How do you search them?• Blogdex Blogdex http://blogdex.media.mit.eduhttp://blogdex.media.mit.edu• Open Directory Open Directory http://dmoz.orghttp://dmoz.org

Computers / Internet / On the Web / WeblogsComputers / Internet / On the Web / Weblogs

Are they part of the Deep Web?Are they part of the Deep Web?• Yes and NoYes and No

Web logs or blogsWeb logs or blogs Google (5/23/02 and 11/4/02)Google (5/23/02 and 11/4/02)

allinurl:blogspot 171,000 | 301,000 allinurl:blogspot 171,000 | 301,000 53%53%

mostly blog home pagesmostly blog home pages

allinurl:oxblog 2 | 39 allinurl:oxblog 2 | 39 1900%1900%

home page and 1 postinghome page and 1 posting FAST (5/23/02 and 11/4/02)FAST (5/23/02 and 11/4/02)

URL:blogspot > 355,671 | 2,434,871 URL:blogspot > 355,671 | 2,434,871 146%146%

mostly blog home pagesmostly blog home pages

URL:oxblog > 0 | 5,510 URL:oxblog > 0 | 5,510 Start your own at http://blogspot.comStart your own at http://blogspot.com

Using subject metasites Using subject metasites (link-oriented)(link-oriented)

Locate subject metasites viaLocate subject metasites via• DirectoriesDirectories• Professional Organizations home pagesProfessional Organizations home pages• Specialized search engine gateways Specialized search engine gateways

(handout)(handout)• Colleagues/ResearchersColleagues/Researchers

Once into a subject metasite scan the Once into a subject metasite scan the page for search boxes and determine if page for search boxes and determine if they search the “surface web” of the site they search the “surface web” of the site only or embedded databases. (This is often only or embedded databases. (This is often not clearly indicated)not clearly indicated)

Using Deep Web gateway sites Using Deep Web gateway sites (database-oriented)(database-oriented)

Become familiar with several (see handout)Become familiar with several (see handout) Most search only the Most search only the home pageshome pages of the of the

databases they include. A few will actually databases they include. A few will actually enter your search terms and display results enter your search terms and display results

Explore their subject areas; some subjects Explore their subject areas; some subjects may not be included at all.may not be included at all.

Deep Web gateways are still in an early Deep Web gateways are still in an early stage of development, seeking broad appeal stage of development, seeking broad appeal rather than a narrow focus.rather than a narrow focus.

Using serendipityUsing serendipity

Sometimes the Deep Web “comes to Sometimes the Deep Web “comes to you”!you”!

Mine your bookmarks/favorites and Mine your bookmarks/favorites and add Deep Web resources when you add Deep Web resources when you come across them by chance.come across them by chance.

Evaluating Information Evaluating Information from the Deep Webfrom the Deep Web

Evaluating Deep Web InformationEvaluating Deep Web Information Embedded databasesEmbedded databases Non-html textual files and password Non-html textual files and password

protected sitesprotected sites Non-textual filesNon-textual files SoftwareSoftware

Embedded DatabasesEmbedded Databases

Typically targeted, focused Typically targeted, focused informationinformation

Content usually generated and used Content usually generated and used by knowledgeable partiesby knowledgeable parties

Database creation and maintenance Database creation and maintenance requires expertise and commitmentrequires expertise and commitment

Site location is usually stableSite location is usually stable

Embedded DatabasesEmbedded Databases

Check author and/or sponsorCheck author and/or sponsor Check for freshnessCheck for freshness Check for breadth or range of Check for breadth or range of

coveragecoverage Compare with other Deep Web Compare with other Deep Web

sources offering similar information, sources offering similar information, especially for online shopping or especially for online shopping or other e-commerce uses.other e-commerce uses.

Non-html textual files and Non-html textual files and password protected sitespassword protected sites

Evaluate as you would any other Evaluate as you would any other information from the Internetinformation from the Internet

BEWARE: If using Google, open non-BEWARE: If using Google, open non-html textual files html textual files as htmlas html when when possible. Opening the file and its possible. Opening the file and its application may transmit a virus.application may transmit a virus.

Image, audio, multimedia filesImage, audio, multimedia files

Check for image/audio qualityCheck for image/audio quality Check for plug-in requirementsCheck for plug-in requirements Check for depth of coverage in the Check for depth of coverage in the

area of your queryarea of your query FEE or FREE???FEE or FREE???

SoftwareSoftware

Check for Check for sponsor/source/maintainersponsor/source/maintainer• Is there a contact person?Is there a contact person?

Check for freshnessCheck for freshness• Latest versions available?Latest versions available?

Check for stability and reliabilityCheck for stability and reliability• Has any virus scanning been done?Has any virus scanning been done?

Check for breadthCheck for breadth• Are programs available for all Are programs available for all

operating systems?operating systems? FEE or FREE???FEE or FREE???

Mining the Deep Web with Mining the Deep Web with Proprietary SoftwareProprietary Software

Directed Query Engines or Directed Query Engines or Intelligent AgentsIntelligent Agents

Designed to access distributed Designed to access distributed Deep Web resourcesDeep Web resources

Can be configured to search Can be configured to search specific URL’sspecific URL’s• DatabasesDatabases• Subject metasitesSubject metasites• report collectionsreport collections• dynamic pagesdynamic pages• online newslettersonline newsletters

Directed Query Engines or Directed Query Engines or Intelligent AgentsIntelligent Agents

Several DQE’s can be “nested” – one Several DQE’s can be “nested” – one query launches several others in a query launches several others in a cascading fashioncascading fashion

Publicly-available examples:Publicly-available examples:• PubMedPubMed• Department of Energy’s Information BridgeDepartment of Energy’s Information Bridge• NASA’s Technical Report ServerNASA’s Technical Report Server

Apple’s Apple’s Sherlock Sherlock (bundled with Mac OS 8.5 or (bundled with Mac OS 8.5 or higher)higher)• Searches Deep Web databases that you specifySearches Deep Web databases that you specify

Directed Query Engines for Directed Query Engines for purchasepurchase

Simultaneous search of Deep Web and Simultaneous search of Deep Web and other resources with many additional other resources with many additional featuresfeatures

Lexibot Lexibot http://www.lexibot.comhttp://www.lexibot.com• If you complete survey: $189 upgrades $15If you complete survey: $189 upgrades $15• If you don’t:If you don’t: $289 upgrades $50 $289 upgrades $50

BullsEye BullsEye http://info.intelliseek.comhttp://info.intelliseek.com• BullsEye Pro:BullsEye Pro: $199 with free upgrades for 6 $199 with free upgrades for 6

monthsmonths

How does the Deep Web fit into my How does the Deep Web fit into my overall search strategy?overall search strategy?

What types of queries are well-What types of queries are well-suited to the Deep Web?suited to the Deep Web?

Information stored in databasesInformation stored in databases• ““One of many similar things”One of many similar things”• Statistics, census dataStatistics, census data• City, county, state, national and City, county, state, national and

international public records, data and international public records, data and lawslaws

• Online reference booksOnline reference books

What types of queries are well-What types of queries are well-suited to the Deep Web?suited to the Deep Web?

Information that is new and Information that is new and dynamically changingdynamically changing• NewsNews• Pricing and availability of goods and Pricing and availability of goods and

servicesservices• Financial data, national and internationalFinancial data, national and international• Job postingsJob postings• Travel schedules and pricingTravel schedules and pricing• Library catalogs and databasesLibrary catalogs and databases

What types of queries are well-What types of queries are well-suited to the Deep Web?suited to the Deep Web?

Non-html textual filesNon-html textual files Non-textual filesNon-textual files SoftwareSoftware Searching blogsSearching blogs

A few words from A few words from Sherman and Price …Sherman and Price …

Authors of Authors of The Invisible WebThe Invisible Web Cyber Age Cyber Age Books, 2000Books, 2000

Datamine your Bookmark/Favorites Datamine your Bookmark/Favorites CollectionCollection

Explore reviewed sites Explore reviewed sites thoroughlythoroughly; ; • They often contain Deep Web resources not They often contain Deep Web resources not

mentioned by the reviewermentioned by the reviewer Subscribe to lists that are focused and Subscribe to lists that are focused and

relevant to your needsrelevant to your needs• No main Deep Web list existsNo main Deep Web list exists• Resources appear in subject-based listsResources appear in subject-based lists

A few words from A few words from Sherman and Price …Sherman and Price …

Create your own “monitoring Create your own “monitoring service”service”• Identify “What’s New” pages and key Identify “What’s New” pages and key

sites you find valuablesites you find valuable• Use C4U to alert you to changes at Use C4U to alert you to changes at

these sites. Gives you the these sites. Gives you the typetype of of change and keywords from the new change and keywords from the new text. Enables you to determine whether text. Enables you to determine whether it’s worth checking or notit’s worth checking or not

• Available FREE at http://www.c4u.comAvailable FREE at http://www.c4u.com

Remember Hunter’s MaximRemember Hunter’s Maximfor the Deep Webfor the Deep Web

Plan to first Plan to first locate the categorylocate the category of of information you want, then browse.information you want, then browse.

Don’t be too specific in your Don’t be too specific in your searches.searches.

Cast a wide net.Cast a wide net.

Thank you and best of luck in Thank you and best of luck in discovering and taming this new discovering and taming this new

Cyber Frontier!!!Cyber Frontier!!!

Michael HunterMichael HunterReference LibrarianReference Librarian

Warren Hunting Smith LibraryWarren Hunting Smith LibraryHobart and William Smith CollegesHobart and William Smith Colleges

Geneva, NY 14456Geneva, NY 14456

(315) 781-3552(315) 781-3552 [email protected]@hws.edu