CP3024 Lecture 10 Search Engines. What is the main WWW problem? With an estimated 800 million web...

36
CP3024 Lecture 10 Search Engines

Transcript of CP3024 Lecture 10 Search Engines. What is the main WWW problem? With an estimated 800 million web...

CP3024 Lecture 10

Search Engines

What is the main WWW problem?

With an estimated 800 million web pages finding the one you want is difficult!

What is a Search Engine?

A page on the web connected to a backend program

Allows a user to enter words which characterise a required page

Returns links to pages which match the query

A Typical Search Engine

Types of Search Engine

Automatic search engine e.g. Altavista, Lycos

Classified Directory e.g. Yahoo!Meta-Search Engine e.g. Dogpile

Components of a Search Engine

Robot (or Worm or Spider)– collects pages– checks for page changes

Indexer– constructs a sophisticated file structure to

enable fast page retrievalSearcher

– satisfies user queries

Query Interface

Usually a boolean interface– (Fred and Jean) or (Bill and Sam)

Normally allows phrase searches– "Fred Smith"

Also proximity searchesNot generally understood by usersMay have extra 'friendlier' features

?

Search Results

Presented as linksSupposedly ordered in terms of relevancy

to the querySome Search Engines score resultsNormally organised if groups of ten per

page

Problems

Links are often out of dateUsually too many links are returnedReturned links are not very relevantThe Engines don't know about enough

pagesDifferent engines return different resultsU.S. bias

Improving query results

To look for a particular page use an unusual phrase you know is on that page

Use phrase queries where possibleCheck your spelling!Progressively use more termsIf you don't find what you want, use

another Search Engine!

Who operates Search Engines?

People who can get money from venture capitalists!

Many search engines originate from U.S. universities

Often paid for by advertisementsEngines monitor carefully what else

interests you (paid by the click)

How do pages get into a Search Engine?

Robot discovery (following links)Self submissionPayments

Robot Discovery

Robots visit sites while following linksThe more links the more visitsMake sure you don't exclude Robots from

visiting public pages

Payments

Some search engines only index paying customers

The more you pay the higher you appear on answers to queries

Self submission

Register your page with a search enginePay for a company to register you with

many search enginesGet registration with many search engines

for free!

Getting to the top

Only relevant queries should be ranked highly

Search engines only look at textSearch engine operators try to stop "search

engine spamming"Some queries are pre-answered

Get where you should be!

Put more than graphics on a pageDon't use framesUse the <ALT….> tagMake good use of <TITLE> and <H1>Consider using the <META> tagGet people to link to your page

Summary

Search Engines are vital to the Web userSearch Engines are not perfect by a long

wayThere are tactics for better searchingPage design can bring more visitors via

Search EnginesThe more links the better!

WWLib-TNG

A Next Generation Search Engine

In the beginning

WWLib-TOS– Manually constructed directory– Classified on Dewey Decimal– Simple data structure– Proof of concept

The New Architecture

The Classifier

Motive - Why Generate Metadata Automatically?

Meta tags are not compulsoryOld pages are less likely to have meta tagsAvailable data can be unreliableThe Web of Trust requires comprehensive

resource descriptionAn essential prerequisite for widespread

deployment of RDF applications

Method - How can Metadata be Generated Automatically?

Using an automatic classifierThe classifier classifies Web Pages

according to Dewey Decimal Classification

Other useful metadata can be extracted during the process of automatic classification

Automatic Classification

Intended to combine the intuitive accuracy of manually maintained classified directories with the speed and comprehensive coverage of automated search engines

DDC has been adopted because of its universal coverage, multilingual scope and hierarchical nature

Automatic Classifier - How does it work?

Firstly, the page is retrieved from a URL or local file and parsed to produce a document object

Automatic Classifier - How does it work?

The document object is then compared with DDC objects representing the top ten DDC classes

Automatic Classifier - How does it work?

Each time a word in the document matches a word in the DDC object, the two associated weights are added to a total score

A measure of similarity is then calculated using a similarity coefficient

Automatic Classifier - How does it work?

If there is a significant measure of similarity the document will be compared with any subclasses of that DDC class

If there are no subclasses (i.e. the DDC class is a leaf node) the document is assigned the classmark

If the result is not significant, the comparison process will proceed no further through that particular branch of the DDC object hierarchy

Metadata elements

The automatic classification process can be used to extract other useful metadata elements other than the classification classmarks:

KeywordsClassmarksWord count

TitleURLAbstract

A unique accession number and associated dates can be obtained and supplied by the system

Metadata elements - Wolverhampton Core

Wolverhampton Core Dublin Core

1 Unique Accession number Identifier

2 Title Title

3 URL Identifier

4 Abstract Description

5 Keywords Subject

6 Classmarks Subject

7 Word count

8 Classification date

9 Last modified date Date

RDF Data Model

RDF Schema

There is a significant overlap with the Dublin Core element set

Requirement for implementation clarityThose that have Dublin Core equivalents

are declared as sub-propertiesMaintain interoperability with Dublin Core

applications

RDF Schema

<rdf:Description ID="Keyword"> <rdf:type rdf:resource="http://www.w3.org/TR/WD-rdf-syntax#Property"/> <rdfs:subPropertyOf resource="http://purl.org/metadata/dublin_core#Subject"/> <rdfs:label>Keyword</rdfs:label> </rdf:Description>

<rdf:Description ID="Classmark"> <rdf:type rdf:resource="http://www.w3.org/TR/WD-rdf-syntax#Property"/> <rdfs:subPropertyOf resource="http://purl.org/metadata/dublin_core#Subject"/> <rdfs:label>Classmark</rdfs:label> </rdf:Description>

Classifier Evaluation

Automatic metadata generation will become important for the widespread deployment of RDF based applications

Documents created before the invention of RDF generating authoring tools also need to be described

RDF utilised in this manner may encourage interoperability between search engines

More info: http://www.scit.wlv.ac.uk/~ex1253/

Current Status of WWLib-TNG

New results interface proposed– R-wheel (CirSA)

Builder and searcher constructed, now being tested

Classifier constructedTest Dispatcher/Analyser/Archiver in place