Repository Statistics Peter Millington Technical Development Officer SHERPA, University of...

Post on 27-Mar-2015

215 views 0 download

Tags:

Transcript of Repository Statistics Peter Millington Technical Development Officer SHERPA, University of...

Repository Statistics

Peter Millington

Technical Development Officer

SHERPA, University of Nottingham

Overview

Introduction

Global statistics

The what & why of repository statistics

Benchmarks & data sources

Compilation methods

Web usage logging tools

Google Analytics demo

Problems and solutions

Group session – Key issues

Global Repository Statistics

Data Sources – Global lists of repositories• OpenDOAR - http://www.opendoar.org/• ROAR - http://roar.eprints.org/• Repository66- http://www.repository66.org/

May be useful for advocacy work

Examples of types of chart & presentation

ROAR – Individual Growth Charts

ROAR – Individual Source Data

Month Records Archives200407 12200408 34200409 77200410 106200411 149200412 164200501 187200502 212200503 272200504 324200505 389200506 426200507 446200508 492200509 547200510 607200511 631200512 750200601 794200602 860200603 1019200604 1090200605 1128200606 1307

Month Records Archives200607 1347200608 1405200609 1469200610 1530200611 1610200612 1705200701 1768200702 1853200703 1934200704 2042200705 2169200706 2239200707 2264200708 2352200709 2374200710 2400200711 2438200712 2484200801 2540200802 2573200803 2611200804 2643200805 2681200806 2689

Delegates’ What and Why of Statistics

Rate of growth• For advocacy• Measure of success – for our paymasters

Rate of usage• Targeting weak areas – departments• Measure of success• Justifying funding

Most downloaded author/paper• Promotes interest and engagement from authors

Delegates’ What and Why of Statistics

Where are visitors coming from – referrers• Curiosity – is it being seen by the right people

Citation statistics• To demonstrate the beneficial impact of repositories

Drilling down for more detail• For a sense reality

Steep slopes, animation, etc• Glitzy marketing

Individual Repositories - Content

Growth & Deposition rates• Measure of progress• Impact of advocacy events• Impact of mandatory deposition

Types of document or item• Trend-watching?

Breakdown by department and/or author• How much is everyone contributing?

Proportion of full text v metadata only• Measure of usefulness

Item types: Universidade do Minho

Individual Repositories - Performance

Proportion of publications deposited• How comprehensive is the archive?

Proportion of authors who are depositing• Are they complying with local mandates?

Compliance with funders’ mandates• Are you meeting your obligations?

Repository administration• Are your turn round times acceptable?

Compliance with the CERN Mandate

Compliance Benchmarks

Counting publications• Institution-wide bibliographies

• e.g. Maintained by research managers

• Publication lists on departmental web pages• Public/Commercial databases – ISI, Medline, etc

Counting authors• Who qualifies as an author?

• Academic staff, Research students, Managers

• University Calendars & Departmental staff lists

Individual Repositories - Usage

Rates of usage• Measure of usefulness• Impact of news-related items

Most downloaded items• Identifying research(ers) with most impact?• Engendering competition between authors?

Downloads according to author• Performance reviews?

Geographical distribution of users• Are you reaching your intended audience?

Sources of Data

Repository’s own database

OAI-PMH

Server’s access log

Remote logging

Compilation Methods

Repository’s own database• Copying from the human interface• Interactive SQL commands

Copying from the Human Interface

Interactive SQL Commands

mysql> SELECT type,COUNT(*) FROM eprint GROUP BY type;

+-----------------+----------+| type | COUNT(*) |+-----------------+----------+| article | 456 || book | 5 || book_section | 39 || conference_item | 173 || exhibition | 1 || monograph | 18 || other | 3 || thesis | 4 |+-----------------+----------+8 rows in set (0.00 sec)

64%1%

6%

25%

0%3%0%1%

article

book

book_section

conference_item

exhibition

monograph

other

thesis

Compilation Methods

Repository’s own database• Copying from the human interface• Interactive SQL commands

OAI-PMH• Harvesting programs – e.g. ROAR’s Celestial

OAI-PMH ListIdentifiers

OAI-PMH ListRecords

ROAR - Celestial

date identifier url20070618 oai:bora.uib.no:1956/2270 Department of Earth Science20070625 oai:bora.uib.no:1956/2272 Department of History 20070625 oai:bora.uib.no:1956/2273 Department of the History of Religions 20070626 oai:bora.uib.no:1956/2274 Section for Endocrinology20070626 oai:bora.uib.no:1956/2275 Department of the History of Religions 20070626 oai:bora.uib.no:1956/2276 Department of the History of Religions 20070626 oai:bora.uib.no:1956/2277 Department of the History of Religions 20070626 oai:bora.uib.no:1956/2278 Department of the History of Religions 20070626 oai:bora.uib.no:1956/2279 Department of Oral Sciences20070626 oai:bora.uib.no:1956/2281 Department of the History of Religions 20070626 oai:bora.uib.no:1956/2282 Department of Sociology 20070626 oai:bora.uib.no:1956/2283 Else Æyen20070628 oai:bora.uib.no:1956/2284 Section for Art History20070629 oai:bora.uib.no:1956/2285 Section for Russian20070629 oai:bora.uib.no:1956/2286 Department of Geography20070629 oai:bora.uib.no:1956/2287 Department of Greek, Latin and Egyptology20070702 oai:bora.uib.no:1956/2288 Section for Spanish20070702 oai:bora.uib.no:1956/2289 Department of Mathematics20070702 oai:bora.uib.no:1956/2290 Department of Geography20070702 oai:bora.uib.no:1956/2291 Department of Geography20070702 oai:bora.uib.no:1956/2292 Department of Biology 20070703 oai:bora.uib.no:1956/2293 Department of Biology

Compilation Methods

Repository’s own database• Copying from the human interface• Interactive SQL commands

OAI-PMH• Harvesting programs – e.g. ROAR’s Celestial

Server’s access log• Web usage statistics tools

Raw Web Access Logs

209.237.238.179 - - [10/Apr/2005:05:34:06 +0100] "GET /portfolio.css HTTP/1.0" 200 816 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:16:27 +0100] "GET /DAWN_Index.htm HTTP/1.0" 200 8392 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:17:44 +0100] "GET /Eric.htm HTTP/1.0" 200 6975 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:21:12 +0100] "GET /Library_Form.htm HTTP/1.0" 200 7709 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:22:48 +0100] "GET /cleansing.htm HTTP/1.0" 200 11016 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:25:02 +0100] "GET /index.htm HTTP/1.0" 200 7613 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:28:19 +0100] "GET /integration.htm HTTP/1.0" 200 8027 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:31:35 +0100] "GET /merging.htm HTTP/1.0" 200 9132 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:34:39 +0100] "GET /publication.htm HTTP/1.0" 200 5327 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:08:22:38 +0100] "GET /ABACUS_Index.htm HTTP/1.0" 200 5421 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:08:27:34 +0100] "GET /limitations.htm HTTP/1.0" 200 3781 "-" "ia_archiver"210.173.179.17 - - [20/Dec/2004:13:22:03 +0000] "GET /robots.txt HTTP/1.1" 404 - "-" "gazz/5.0 (gazz@nttr.co.jp)"210.173.179.17 - - [20/Dec/2004:13:23:51 +0000] "GET / HTTP/1.1" 200 7613 "-" "gazz/5.0 (gazz@nttr.co.jp)"210.173.179.17 - - [20/Dec/2004:13:25:34 +0000] "GET /Logo.gif HTTP/1.1" 200 3838 "-" "gazz/5.0 (gazz@nttr.co.jp)"210.173.179.17 - - [20/Dec/2004:13:27:17 +0000] "GET /contact.htm HTTP/1.1" 200 4626 "-" "gazz/5.0 (gazz@nttr.co.jp)"210.173.179.17 - - [20/Dec/2004:13:29:00 +0000] "GET /profile.htm HTTP/1.1" 200 10533 "-" "gazz/5.0

(gazz@nttr.co.jp)"210.173.179.17 - - [20/Dec/2004:13:37:35 +0000] "GET /index.htm HTTP/1.1" 200 7613 "-" "gazz/5.0 (gazz@nttr.co.jp)"210.173.179.17 - - [20/Dec/2004:13:47:55 +0000] "GET /publication.htm HTTP/1.1" 200 5327 "-" "gazz/5.0

(gazz@nttr.co.jp)"210.173.179.17 - - [20/Dec/2004:13:49:39 +0000] "GET /InsideInfo.jpg HTTP/1.1" 200 19372 "-" "gazz/5.0

(gazz@nttr.co.jp)"

Recorded fields include:• IP Address of the computer requesting a file• Date & time transaction completed• Name of file requested• Success code – usually 200 for “successfully completed”• File size in bytes

Web Usage Statistics Tools

Analog• http://www.analog.cx/

Webalizer• http://www.mrunix.net/webalizer/

AWStats• http://www.mrunix.net/webalizer/

etc.

Sample output from theAnalog Statistics Package

Sample output from theWebalizer Statistics Package

Sample output from theAWStats Statistics Package

Compilation Methods

Repository’s own database• Copying from the human interface• Interactive SQL commands

OAI-PMH• Harvesting programs – e.g. ROAR’s Celestial

Server’s access log• Web usage statistics tools

Remote logging• Google Analytics

Google Analytics

http://www.google.com/analytics

Sign up to a Google Account

Specify the URL to be logged

Obtain snippet of JavaScript code

Insert snippet into HTML of pages to be logged• Ideally into a template file• Make sure the modified pages are live!

Logging starts automatically

Log in to your account to view the analytics

Google Analytics

JavaScript snippet <script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");

document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));

</script>

<script type="text/javascript">

var pageTracker = _gat._getTracker("UA-3477654-3");

pageTracker._initData();

pageTracker._trackPageview();

</script>

Find URL Containing/Excluding• String

• e.g. “pdf”

• Regular expressions• e.g. /[0-9]*/ for EPrints IDs

Problems

Web bots and crawlers• Inflating usage volume• Scewing usage time series

Auxiliary files & non-eprint pages• CSS style sheet files• Image files – jpeg, gif, etc.• Index pages

Linking URLs to bibliographic references• What does that eprint number mean?

Problems and Solutions

Web bots and crawlers• Use robots.txt & meta robots tags to prevent crawling• Filtering out known bots• Still leaves maverick hackers’ & students’ bots

Auxiliary files & non-eprint pages• Configuring & tuning the analysis tool• Filter using ‘regular expressions’

Linking URLs to bibliographic references• Programmatic concordance• e.g. IRStats

Over to Chris for DSpace statistics…

What are your priorities for statistics?

Peter Millington

peter.millington@nottingham.ac.uk