WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email:...

34
WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: [email protected]

Transcript of WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email:...

Page 1: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch

Ian Peacock

UKOLN

University of Bath

Bath BA2 7AY

UK

+44 1225 323570

Email: [email protected]

Page 2: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatching the UK:Robot software for analysing UK

web resources

UKOLN is funded by the British Library Research and Innovation Centre, the Joint Information Systems Committee of the Higher Education Funding Councils, as well as by project funding from the JISC’s Electronic Libraries Programme and the European Union.

UKOLN also receives support from the University of Bath where it is based.

Page 3: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Robot software

• WebWatch.

• WebWatch experiences.

• General robot issues.

• The need for robots.

• Bad press.

• Awareness.

Page 4: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

The WebWatch project

• A one year post funded by RIC.

• “..to develop a set of tools to audit and monitor design practice and use of technologies on the web..”.

• Communities. UK web communities.

• Information to benefit institutions/communities.

Page 5: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

The WebWatch project

Information on the project can be found at <URL:http://www.ukoln.ac.uk/web-focus/webwatch/>.

Page 6: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch aims

• Evaluation of robot technologies.

• Making recommendations on appropriate technologies.

• Working within UK web communities.

• Analysis of the results of web crawling and leasing with various communities in interpreting the results.

Page 7: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch aims

• Working with the web robot community.

• Analysing other related resources, such as web logs.

Page 8: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch robot

• Experimentation.

• Harvest.

• Perl based robot.

Page 9: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch analyses

• Production of a report.

• SOIF records.

• CSV.

• Excel, SPSS,…

• Current developments.

Page 10: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch benefits

Benefits

• Communities.

• Web managers and designers.

• Knowledge base.

Page 11: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch robot

• History– Harvest– Experiences with Perl– ?

• Features

• Future plans

Page 12: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch robot

Type{4}: HTML

Type-recognition by{4}: MIME

Linked from{23}: http://www.ukoln.ac.uk/

Context{4}: Link

Element-referrer{5}: LINKS

p-count{1}: 3

a-21-attrib{55}: href=http://www.ukoln.ac.uk/services/elib/papers/other/

img-9-attrib{110}: width=87|src=http://www.ukoln.ac.uk/resources/images/ukoln-logo/logo|height=101|alt=UKOLN|align=right|border=0

Examples of robot output

HTML element information

Page 13: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Robot issues

• Definition of a (web) robot.

• The need for robots

Page 14: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Robot issues

The need for robots?

• Web expansion and increasing non-linearity.

• Understanding the nature of the web to help solve problems.

• Maintenance.

• Construction of index-space.

• Navigable document-space.

Page 15: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Increasing non-linearity

URL A

URL B URL C

URL D

Page 16: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Benefits of robots

• End-user satisfaction.

• Reduced network traffic in document space.

• Populating caches, archiving, mirroring.

• Monitoring changes relevant to users.

• ‘Schooling’ network traffic into localised neighbourhoods.

Page 17: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Benefits of robots

• A user view (as opposed to a file-system view).

• Non fatiguing.

• Next generation.

• These properties offer feasible solution to web problems?

Page 18: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Robot design

• Is it necessary?

• Traversal algorithm (depth vs breadth first).

• Black holes and correct implementations (e.g. redirects).

• Bounds on activity.

• Multiple requests.

Page 19: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Example of a ‘black-hole’

Client requests: http://www.foo.bar/generate_report?date=02021998&time=1250

Server returns document with this link:<A

HREF=“http://www.foo.bar/generate_report&date=02021998&time=old_time+5”>

Page 20: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Robot design (continued)

• Caching directives

Page 21: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Ethical robots

• Reuse of robot code.

• Appropriate identification.

• Thorough testing (locally!).

• Speed/frequency bounding.

• Selective retrieval.

• Performance monitoring.

• Dissemination of results.

Page 22: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Ethical web crawling

• Advantages vs disadvantages..

• Guidelines

Page 23: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Robot Exclusion

Refers to means available to users and server administrators to control robot navigation through a particular server.

Advantages. Disadvantages.

Currently two kinds of Robot Exclusion Protocol (REP).

Page 24: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Robot exclusion protocols

• Server-wide method (/robots.txt)– Directives for the whole server must be

under the top level /robots.txt.

• META element method (per page).– Directives are inserted per page with the

META element. Directives allow for indexing (or not) and parsing for links (or not).

Page 25: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Other methods of robot control

• Blocking at the server configuration level (e.g. Apache’s allow from, deny from).

• Blocking at the TCP level (TCP wrappers?)

• Page design?

Page 26: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Network performance

• Bandwidth issues.

• Comparison with a human user.

• Bottlenecks.

• New developments in robots..good or bad? Decentralisation.

Page 27: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

Server concerns

• Rapid fire requests (TCP, HTTP).

• Skewing of server logs.

• Identification of robots.

Page 28: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

The future of web robots

• Intelligent agents.

• Metadata standards (XML, RDF, CDF, embedded metadata).

• Robots becoming part of the web.

Page 29: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch findingsAnalysis of URLs

Domains for public library web sites

Page 30: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch findingsServer software

Servers used to serve eLib project pages

Page 31: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch findingsFile size analyses

HTML file sizes for UK University entry-points

Page 32: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch findings

Top ten tags used within the eLib community

HTML analyses

Page 33: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch findingsHyperlink profiles

Top ten external domains linked to from all eLib pages

Page 34: WebWatch Ian Peacock UKOLN University of Bath Bath BA2 7AY UK +44 1225 323570 Email: i.peacock@ukoln.ac.uk.

WebWatch findingsAnalysis of other document content

Use of metadata in UK university homepages