WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email:...

Post on 29-Dec-2015

214 views 1 download

Transcript of WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email:...

WebWatch

Ian Peacock

UKOLN

University of Bath

Bath BA2 7AY

UK

+44 1225 323570

Email: i.peacock@ukoln.ac.uk

WebWatching the UK:Robot software for analysing UK

web resources

UKOLN is funded by the British Library Research and Innovation Centre, the Joint Information Systems Committee of the Higher Education Funding Councils, as well as by project funding from the JISC’s Electronic Libraries Programme and the European Union.

UKOLN also receives support from the University of Bath where it is based.

Robot software

• WebWatch.

• WebWatch experiences.

• General robot issues.

• The need for robots.

• Bad press.

• Awareness.

The WebWatch project

• A one year post funded by RIC.

• “..to develop a set of tools to audit and monitor design practice and use of technologies on the web..”.

• Communities. UK web communities.

• Information to benefit institutions/communities.

The WebWatch project

Information on the project can be found at <URL:http://www.ukoln.ac.uk/web-focus/webwatch/>.

WebWatch aims

• Evaluation of robot technologies.

• Making recommendations on appropriate technologies.

• Working within UK web communities.

• Analysis of the results of web crawling and leasing with various communities in interpreting the results.

WebWatch aims

• Working with the web robot community.

• Analysing other related resources, such as web logs.

WebWatch robot

• Experimentation.

• Harvest.

• Perl based robot.

WebWatch analyses

• Production of a report.

• SOIF records.

• CSV.

• Excel, SPSS,…

• Current developments.

WebWatch benefits

Benefits

• Communities.

• Web managers and designers.

• Knowledge base.

WebWatch robot

• History– Harvest– Experiences with Perl– ?

• Features

• Future plans

WebWatch robot

Type{4}: HTML

Type-recognition by{4}: MIME

Linked from{23}: http://www.ukoln.ac.uk/

Context{4}: Link

Element-referrer{5}: LINKS

p-count{1}: 3

a-21-attrib{55}: href=http://www.ukoln.ac.uk/services/elib/papers/other/

img-9-attrib{110}: width=87|src=http://www.ukoln.ac.uk/resources/images/ukoln-logo/logo|height=101|alt=UKOLN|align=right|border=0

Examples of robot output

HTML element information

Robot issues

• Definition of a (web) robot.

• The need for robots

Robot issues

The need for robots?

• Web expansion and increasing non-linearity.

• Understanding the nature of the web to help solve problems.

• Maintenance.

• Construction of index-space.

• Navigable document-space.

Increasing non-linearity

URL A

URL B URL C

URL D

Benefits of robots

• End-user satisfaction.

• Reduced network traffic in document space.

• Populating caches, archiving, mirroring.

• Monitoring changes relevant to users.

• ‘Schooling’ network traffic into localised neighbourhoods.

Benefits of robots

• A user view (as opposed to a file-system view).

• Non fatiguing.

• Next generation.

• These properties offer feasible solution to web problems?

Robot design

• Is it necessary?

• Traversal algorithm (depth vs breadth first).

• Black holes and correct implementations (e.g. redirects).

• Bounds on activity.

• Multiple requests.

Example of a ‘black-hole’

Client requests: http://www.foo.bar/generate_report?date=02021998&time=1250

Server returns document with this link:<A

HREF=“http://www.foo.bar/generate_report&date=02021998&time=old_time+5”>

Robot design (continued)

• Caching directives

Ethical robots

• Reuse of robot code.

• Appropriate identification.

• Thorough testing (locally!).

• Speed/frequency bounding.

• Selective retrieval.

• Performance monitoring.

• Dissemination of results.

Ethical web crawling

• Advantages vs disadvantages..

• Guidelines

Robot Exclusion

Refers to means available to users and server administrators to control robot navigation through a particular server.

Advantages. Disadvantages.

Currently two kinds of Robot Exclusion Protocol (REP).

Robot exclusion protocols

• Server-wide method (/robots.txt)– Directives for the whole server must be

under the top level /robots.txt.

• META element method (per page).– Directives are inserted per page with the

META element. Directives allow for indexing (or not) and parsing for links (or not).

Other methods of robot control

• Blocking at the server configuration level (e.g. Apache’s allow from, deny from).

• Blocking at the TCP level (TCP wrappers?)

• Page design?

Network performance

• Bandwidth issues.

• Comparison with a human user.

• Bottlenecks.

• New developments in robots..good or bad? Decentralisation.

Server concerns

• Rapid fire requests (TCP, HTTP).

• Skewing of server logs.

• Identification of robots.

The future of web robots

• Intelligent agents.

• Metadata standards (XML, RDF, CDF, embedded metadata).

• Robots becoming part of the web.

WebWatch findingsAnalysis of URLs

Domains for public library web sites

WebWatch findingsServer software

Servers used to serve eLib project pages

WebWatch findingsFile size analyses

HTML file sizes for UK University entry-points

WebWatch findings

Top ten tags used within the eLib community

HTML analyses

WebWatch findingsHyperlink profiles

Top ten external domains linked to from all eLib pages

WebWatch findingsAnalysis of other document content

Use of metadata in UK university homepages