WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email:...

WebWatch

Ian Peacock

University of Bath

Bath BA2 7AY

+44 1225 323570

Email: i.peacock@ukoln.ac.uk

WebWatching the UK:Robot software for analysing UK

web resources

UKOLN is funded by the British Library Research and Innovation Centre, the Joint Information Systems Committee of the Higher Education Funding Councils, as well as by project funding from the JISC’s Electronic Libraries Programme and the European Union.

UKOLN also receives support from the University of Bath where it is based.

Robot software

• WebWatch.

• WebWatch experiences.

• General robot issues.

• The need for robots.

• Bad press.

• Awareness.

The WebWatch project

• A one year post funded by RIC.

• “..to develop a set of tools to audit and monitor design practice and use of technologies on the web..”.

• Communities. UK web communities.

• Information to benefit institutions/communities.

The WebWatch project

Information on the project can be found at <URL:http://www.ukoln.ac.uk/web-focus/webwatch/>.

WebWatch aims

• Evaluation of robot technologies.

• Making recommendations on appropriate technologies.

• Working within UK web communities.

• Analysis of the results of web crawling and leasing with various communities in interpreting the results.

WebWatch aims

• Working with the web robot community.

• Analysing other related resources, such as web logs.

WebWatch robot

• Experimentation.

• Harvest.

• Perl based robot.

WebWatch analyses

• Production of a report.

• SOIF records.

• CSV.

• Excel, SPSS,…

• Current developments.

WebWatch benefits

Benefits

• Communities.

• Web managers and designers.

• Knowledge base.

WebWatch robot

• History– Harvest– Experiences with Perl– ?

• Features

• Future plans

WebWatch robot

Type{4}: HTML

Type-recognition by{4}: MIME

Linked from{23}: http://www.ukoln.ac.uk/

Context{4}: Link

Element-referrer{5}: LINKS

p-count{1}: 3

a-21-attrib{55}: href=http://www.ukoln.ac.uk/services/elib/papers/other/

img-9-attrib{110}: width=87|src=http://www.ukoln.ac.uk/resources/images/ukoln-logo/logo|height=101|alt=UKOLN|align=right|border=0

Examples of robot output

HTML element information

Robot issues

• Definition of a (web) robot.

• The need for robots

Robot issues

The need for robots?

• Web expansion and increasing non-linearity.

• Understanding the nature of the web to help solve problems.

• Maintenance.

• Construction of index-space.

• Navigable document-space.

Increasing non-linearity

URL B URL C

Benefits of robots

• End-user satisfaction.

• Reduced network traffic in document space.

• Populating caches, archiving, mirroring.

• Monitoring changes relevant to users.

• ‘Schooling’ network traffic into localised neighbourhoods.

Benefits of robots

• A user view (as opposed to a file-system view).

• Non fatiguing.

• Next generation.

• These properties offer feasible solution to web problems?

Robot design

• Is it necessary?

• Traversal algorithm (depth vs breadth first).

• Black holes and correct implementations (e.g. redirects).

• Bounds on activity.

• Multiple requests.

Example of a ‘black-hole’

Client requests: http://www.foo.bar/generate_report?date=02021998&time=1250

Server returns document with this link:<A

HREF=“http://www.foo.bar/generate_report&date=02021998&time=old_time+5”>

Robot design (continued)

• Caching directives

Ethical robots

• Reuse of robot code.

• Appropriate identification.

• Thorough testing (locally!).

• Speed/frequency bounding.

• Selective retrieval.

• Performance monitoring.

• Dissemination of results.

Ethical web crawling

• Advantages vs disadvantages..

• Guidelines

Robot Exclusion

Refers to means available to users and server administrators to control robot navigation through a particular server.

Advantages. Disadvantages.

Currently two kinds of Robot Exclusion Protocol (REP).

Robot exclusion protocols

• Server-wide method (/robots.txt)– Directives for the whole server must be

under the top level /robots.txt.

• META element method (per page).– Directives are inserted per page with the

META element. Directives allow for indexing (or not) and parsing for links (or not).

Other methods of robot control

• Blocking at the server configuration level (e.g. Apache’s allow from, deny from).

• Blocking at the TCP level (TCP wrappers?)

• Page design?

Network performance

• Bandwidth issues.

• Comparison with a human user.

• Bottlenecks.

• New developments in robots..good or bad? Decentralisation.

Server concerns

• Rapid fire requests (TCP, HTTP).

• Skewing of server logs.

• Identification of robots.

The future of web robots

• Intelligent agents.

• Metadata standards (XML, RDF, CDF, embedded metadata).

• Robots becoming part of the web.

WebWatch findingsAnalysis of URLs

Domains for public library web sites

WebWatch findingsServer software

Servers used to serve eLib project pages

WebWatch findingsFile size analyses

HTML file sizes for UK University entry-points

WebWatch findings

Top ten tags used within the eLib community

HTML analyses

WebWatch findingsHyperlink profiles

Top ten external domains linked to from all eLib pages

WebWatch findingsAnalysis of other document content

Use of metadata in UK university homepages

WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email:...

Documents

Transcript of WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email:...

1 Our electronic future Penny Garrod UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: Email: p.garrod@ukoln.ac.uk URL

1 Filtering - Is This The Answer? Sarah Ormes UKOLN University of Bath Bath, BA2 7AY UKOLN is funded by the Library and Information Commission, the Joint.

Disseminating News Within Your Organisation Brian Kelly UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: Email B.Kelly@ukoln.ac.uk URL

A centre of expertise in digital information management Benchmarking Your Web Site Brian Kelly UKOLN University of Bath Bath, BA2 7AY Email.

September 2000 1 Public Library Web Managers Workshop 2000 Cascading Style Sheets Manjula Patel UKOLN University of Bath Bath, BA2 7AY UKOLN is funded.

1 Benchmarking your Web Site Marieke Napier UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: Email m.napier@ukoln.ac.uk URL

Ethiopian Village Studies3 East 2.10 University of Bath Bath, BA2 7AY UK Email: P.G.Bevan@bath.ac.uk Dr Alula Pankhurst Department of Anthropology and Sociology Addis Ababa University;

G. R. BURTON - Numdamarchive.numdam.org/article/AIHPC_1989__6_4_295_0.pdf · G. R. BURTON School of Mathematical Sciences, University of Bath, Claverton Down, Bath, BA2 7AY, ... fonctionnelle

The Latest Web Developments Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: Email B.Kelly@ukoln.ac.uk URL

Benchmarking Web Sites Brian Kelly UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: Email B.Kelly@ukoln.ac.uk URL

WebWatch 2010

Bridget Robinson & Pete Johnston UKOLN, University of Bath Bath, BA2 7AY

Bridget Robinson (content by Pete Johnston) UKOLN, University of Bath Bath, BA2 7AY

Relational job crafting: Exploring the role of employee ... · Yasin Rofcanin, University of Bath, School of Management, Bath, BA2 7AY, UK. Email: y.rofcanin@bath.ac.uk 779121 HUM

Pete Johnston UKOLN, University of Bath Bath, BA2 7AY

Template for Electronic Submission to ACS Journals · Web view‡: Department of Chemistry, University of Bath, Bath BA2 7AY, UK. Supporting Information Placeholder ABSTRACT: Controlling

1 Alternative Approaches: Technical Issues and IPR Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY B.Kelly@ukoln.ac.uk UKOLN is funded.

Webzine Technologies Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY UKOLN is funded by Resource: The Council for Museums, Archives and.

University of Bath · 1 Governing Homelessness Through Running Contact Information Bryan C. Clift, b.c.clift@bath.ac.uk University of Bath Department for Health Bath BA2 7AY, UK

Auditing and Monitoring Your Web Site Brian Kelly UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: Email B.Kelly@ukoln.ac.uk URL