Presented by: Allen Brown IS/SE Date : 2003 - 05-12

38
Searching the Web Or “If there’s so much out there, why can’t I find it?” Presented by: Allen Brown IS/SE Date: 2003-05-12

description

Searching the Web Or “If there’s so much out there, why can’t I find it?”. Presented by: Allen Brown IS/SE Date : 2003 - 05-12. . Outline - Searching the Web. Information Cartography Visible and Invisible Web Information Information Finding Strategies - PowerPoint PPT Presentation

Transcript of Presented by: Allen Brown IS/SE Date : 2003 - 05-12

Page 1: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the WebOr “If there’s so much out there, why can’t I find it?”

Presented by: Allen Brown IS/SE

Date: 2003-05-12

Page 2: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 2

Outline - Searching the Web

1. Information Cartography2. Visible and Invisible Web Information3. Information Finding Strategies4. Reference Tools, Pathfinders,

Specialized Information Repositories, Subject Directories, and Search Engines

5. Information Search Strategies6. Information Evaluation Strategies7. Information Finding Summary8. Search Engines and their Characteristics

Page 3: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 3

Information CartographyImagine a physical map of an ocean basin• identifiable areas of the sea floor• large abyssal plain• many undulating hills above the plain• occasional higher elevations or plateaus• sparse atolls and seamounts

Imagine the Web• some information content identifiable by subject• vast amounts of very low value information• some good stuff distributed across many sites• occasional high quality site with quality and quantity• sparse stunningly useful sites (to die for)

Page 4: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 4

Information Cartography - 2

In searching for information we need to adjust the:•breadth of search to find all that is relevant in an “ocean” of information•quality level to find only “atolls” of information quality

to find everything that is important and useful

quality

completeness

Information issues:

+ location!

Page 5: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 5

Information space

Visible and Invisible Information

Visible = indexed by search engine

engine 4 db 2

engine 3

engine 1

engine 2

site 3

db 1 db 4site 7

site 5db 6

Invisible = not indexed but accessible

Page 6: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 6

Search Engines Won’t Do It All!

According to a recent study reported in Nature (1) no search engine indexes more than 16% of the Web. Even though search engine databases are enormous, they cover very little of what's actually available on the Web.

1) Steve Lawrence and C. Lee Giles. (July 8, 1999). Accessibility of Information on the Web. Nature, 400, 107 - 109

Page 7: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 7

Information Finding StrategiesIdentify Starting Points based on your question:

What type of information do you need?Facts, statistics, government document, scholarly articles, popular

opinion, music, picture, multimedia, news, …

What form do you want the information in?Dictionary definition, encyclopedia entry, journal article, elementary

school project, video file, audio file, …

What type of site would offer this information?Academic, commercial, government, non-government organization

How much information do you need?Introduction, in-depth, references, …

Page 8: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 8

Information Finding

Reference Materials (Often invisible)– dictionaries, thesauri, encyclopedia, newspapers

Information Pathfinders (Sometimes invisible) / Portals / Vortals– subject specific, highly relevant, sometimes bizarre– usually high quality– managed by dedicated enthusiasts, possibly amateur– e.g., Web design, Perl, micro cars, Curta calculators, …

Specialized Information Repositories (Often invisible) / Portals– institution-based, sometimes obscure– usually high quality– managed by information professionals– e.g., government documents, archives, …

Page 9: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 9

Information Finding - 2

Subject Indices (Often invisible – but this is changing)– subject-based– e.g., Yahoo

Search Engines and Search Brokers (Visible web)– e.g., Google, Alta Vista, Hot Bot, Lycos, Vivisimo, dogpile

Page 10: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 10

Reference Tools - Dictionaries http://www.yourdictionary.com/

Page 11: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 11

Reference Tools - Thesauri http://www.visualthesaurus.com/index.jsp

Page 12: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 12

Reference Tools - Encyclopedia http://www.britannica.com/

Page 13: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 13

A pathfinder site provides an information map of what is available within a fairly narrow area of interest; usually compiled by domain experts. These sites are often called “vortals” (vertical portals).

Pathfinders

Page 14: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 14

Specialized Information Repositories - National Library of Canada

A specialized information repository often collects and catalogues relatively specific information; usually compiled by information experts. Some are considered to be vortals.

Page 15: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 15

Subject directories are lists compiled by people. They are organized in a hierarchy where each subject includes a list of sub-topics. These sites are often called “portals” - a one-site starting location for general information seeking.

Subject Directorieswww.yahoo.com

Page 16: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 16

Subject Directories

Subjects lists are usually evaluated but sites are not presented in order of relevancy. In other words, the best sites on a topic are not necessarily listed first. Sites are compiled through submission of URLs by site creators and human evaluation and selection.

One advantage of is their browsability, although this feature is only suitable with fairly general topics. A disadvantage is their relatively small size.

Other examples of subject directories :Infomine: http://infomine.ucr.edu Scout Report Signpost: http://www.signpost.org/signpost

Page 17: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 17

Invisible Web Directories

Look athttp://www.invisible-web.net/

Page 18: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 18

Search Engines

Search engines use computer programs that automatically collect web sites using "spiders" or "robots". The sites are indexed and stored in an index database.

To query a search engine, type topic keywords and Boolean connectors into a search "box." The search engine scans its index and returns links to websites containing the specified keyword relationships.

Size matters - an advantage of using search engines is their coverage (though size is relative), but this can also be a disadvantage if relevance ranking is poor.

Page 19: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 19

UserWorldWideWeb

indexdatabase

Search Engine

Search Engines: Operational Concepts

query

query results

crawling and page contents

extraction and

indexing

query parsing,

index lookup, results

ranking and management

Page 20: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 20

Search Engines - Does Size Matter?

Page 21: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 21

Size

If you are looking for unusual or hard-to-find information should try one or more of the search engines with a large index to check more web content. This improves the likelihood of finding what you seek. However, for general searches or when looking for information about popular topics, a large index does not necessarily equal better results. Also, large indexes may have longer re-visit intervals.

Page 22: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 22

Search Engines:Search Scopingand Ranking / Results ManagementIt is essential to learn and apply each engine's specialized search

formats to narrow results and filter and push the most relevant pages to the top of the results list. Use Boolen operators, proximity connectors, stems, wild cards, sounds-like, media-type and metadata filters.

Result relevancy ranking also depends on the size of the search index and how the search engine interprets and uses your query.

Each engine determines result relevancy ranking in unique ways. Consult the help file of each engine to learn about these.

Some engines offer search refinement and conceptual clustering for better focus (tighter “hit cluster”) or greater accuracy / validity (centred on the “right stuff”).

Page 23: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 23

Search Engines - Search Scoping

+ expands the scope, - reduces the scope• Exact phrase - - quotes, e.g., “We hold these things to be self-evident”• Boolean operators - and - (default) or + (caution!) not - (extreme

caution!), e.g., large male dog, large or male or dog, not cat• Proximity connectors - near - (depends on engine), e.g., spring near flower• Stemming and wildcards - + e.g., swim* swim, swimming, swimmer,

swimmers, swimmingly, …• Sounds-like - + e.g., table cable, able, fable, …• Media type - - e.g., image, audio file, …• Concept-based + - e.g., synonym thesaurus, antonym, homonym, …• Metadata-based - - in some systems

Page 24: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 24

Search Engines - Ranking

Result relevancy ranking (=“usefulness”) can be done according to two techniques (or some combination):

• Conventional - using intra-page information• Relative - using extra-page information

Page 25: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 25

Search Engines - Conventional RankingConventional (intra-page):• frequency of words (number and density)• phrases (exact word sequences)• hierarchy (e.g., closer to the top of the document)• adjacency (proximity of words)• metadata (keywords provided by content owners)• font size and style (relative intra-page)

Jack Christensen repairs CURTA calculators. I've known Jack for many years and can highly recommend him. Here are a few questions I asked Jack: What do you charge to clean a Curta? Typically $65 to $95, depending on the work involved. More often than not, the upper carriage needs a complete disassembly, whereas the main body can be cleaned without a complete disassembly. If the main body needs to be completely disassembled, something is usually bent, out of adjustment, or broken. What do you charge when repairing a Curta? I charge $20 per hour of my time. It seems my hours are about 90 minutes long, however, because I rarely finish in the time I originally quoted. Extended repair time is absorbed by me. What spare parts do you have? Are they expensive? I actually have many hundreds of new original Curta parts. Most are for inside the instrument, though. I use them when I do general cleaning and repairs. Outer body pieces, replacement cannisters, and external parts that are easily damaged or broken due to abuse are not generally available, although I do occasionally locate some these items. Sometimes I have to fabricate a part, or repair an item as best I can. Obviously, this takes time, and the cost is high. Parts costs are charged as the traffic will bear. I usually try to be blunt about this to the Curta owner, often telling them that a severely damaged unit is best sold as a "parts Curta". Unfortunately, I've sometimes had to tell this to someone who wanted to repair a Curta looked upon as an heirloom. What to them appears to be a minor issue actually turns out to be a major problem (e.g., a crank handle tilted downward is due to a broken main shaft). I think the most I ever charged for a repair was about $375. There were many severe problems with the unit. Generally, when the price gets to be above $175 most people simply decide to keep the damaged Curta as a memento. Can you replace a clearing ring? What costs are involved? The plastic clearing rings are easy to install. I have several new ones, but I typically do not sell them separately as a spare part. Rather, I install them during a general cleaning and repair. Metal rings are more difficult to replace. As with the plastic clearing rings, I will only install a metal clearing ring during a general cleaning and repair. It takes a special tool to properly swage the rivet in place. [Editor's note: Very old Type I clearing rings were held on with a screw and nut. The nut was also crimped to the screw threads.] I used all the new metal clearing rings I had about five years ago, but I do have a few used ones that were removed from other damaged Curtas. I have these for both the Type I and

Page 26: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 26

Search Engines - Relative Ranking

Relative (extra-page):• popularity (page visits - from the search engine)• citation (links pointing to the item)• relevance of the pages containing the links pointing to the item (!)

Yahoo

Web Pages

Page 27: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 27

Search Engines: Keys to Success

Ranked and manageable results query construction and search engine features

World Wide Web Size Large index and / or several engines

Scoped query “wide net” but appropriate “sieve”

carefully constructed for your needs

Page 28: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 28

Meta Search Engines

“Meta" search tools are able to search the index databases of multiple engines “simultaneously”, via a single interface.

“Meta” search tools don’t really search metadata. They are simply brokers that reformulate a query and hand it off to a set of search engines, then combine the results.

“Meta” engines are very fast but they do not offer the same level of control over the relationship between keywords as do individual search engines.

Also, meta search engines may produce poor ranking of combined results.

Page 29: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 29

Search Engines

Examples of popular search engines include:

Google: http://www.google.com Alta Vista: http://www.altavista.comAll the Web http://www.alltheweb.com Northern Light: http://www.northernlight.com Also seeThe KartOO clustering visual engine http://www.kartoo.com/For meta engines, try Vivisimo at http://vivisimo.com/

Page 30: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 30

Information Search Strategies

• Think hard about what you are looking for!• Use a Reference Tool, if appropriate• Use a Pathfinder, if you know one• Use a Specialized Information Repository, if appropriate• Use Subject Indexes, if it is a common topic• Use several Search Engines, if needed, especially for the obscure or

academic topic, but learn how they work• Use keywords - be narrow, and specific (and technical)• Use phrases - try synonyms or related concepts• Use Boolean connectors - but find out if / how the engine uses them• Use stemming and wildcards - but find out if / how the engine uses

them• Use media-type filters or metadata, if appropriate

Page 31: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 31

Information Search Tools - Use

breadth

depth

Reference Tool

Pathfinder

Specialized Information Repository

Subject Indexes

Search Engines and Meta-engines

generic simple lookup created by professionals contains “invisible” content

obscure oracademic caveat emptor!

popular or common pre-selected by interested people

related or themed pre-selected by professionalscontains “invisible” content

focused content pre-selected by domain experts

Informationspace

hard to use well

easy to use

Page 32: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 32

CARS checklist:http://library.queensu.ca./inforef/guides/evalchart.htm• Credibility

- author credentials stated with email contact- evidence of quality control (site location)

• Accuracy- timeliness - comprehensiveness - audience & purpose

• Reasonableness- fairness- objectivity - consistency- world view

• Support- source documentation or bibliography

Information Evaluation Strategies: CARS

Page 33: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 33

Summary There is much information on the Web, but it’s not:

- all there- all good (or all bad)- always easy to locate

Use an information search strategy that:- matches the information sought - uses the appropriate tools- uses them in the correct ways

Use an information evaluation strategy, e.g., CARS methodology.

Choose and use search engines wisely, knowing their strengths, features, and their limitations.

Page 34: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 34

How Do Search Engines Work?

Three Activities Occur:1. Crawling

– fetch pages– compile URL list (a db)– re-visit pages

2. Page harvesting– parse page– add to index db and establish ranking

3. Responding to search requests– parse query– apply to index– present and rank results

Page 35: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 35

Search Engines: Operation

Search Engine

Indexdatabase

CrawlerRobot

HarvesterRobot

URLdatabase

QueryProcessor

query

query results

pagecontents

Really clever stuffin here

Fairly clever stuffin here

URL

WorldWideWeb

fetch

fetch

URL

User

re-visit

Page 36: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 36

Search Engine - Hardware

(not really …)

Page 37: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 37

How Do Search Engines Work?

• See “The Anatomy of a Large-Scale Hypertextual Web Search Engine” at http://www-db.stanford.edu/~backrub/google.html

Page 38: Presented by:  Allen Brown IS/SE Date : 2003 - 05-12

Searching the Web - 38

References• Information Search Strategies:

<http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/FindInfo.html>

• Information Evaluation Strategies:<http://www.vuw.ac.nz/~agsmith/evaln/evaln.htm>

• Search Engines:< http://www.library.arizona.edu/search.htm>< http://www.brightplanet.com/deepcontent/tutorials/search/index.asp >< http://www.searchenginewatch.com/ >

• Susan Maze, David Moxley, Donna Smith: Authoritative Guide to Web Search Engines, Neal Schuman Pub, 1997, ISBN 1555703054