The use of bibliometrics to measure research...

The use of bibliometrics tomeasure research quality in UKhigher education institutions

Research report

41153 UniUK Bibliometrics 11/7/07 10:40 AM Page i

This series of Research Reports published byUniversities UK will present the results of researchwe have commissioned in support of our policydevelopment function. The series aims todisseminate project results in an accessible formand there will normally be a discussion of policyoptions arising from the work.

This report has been prepared for Universities UK byEvidence Ltd.

Research Reports

The copyright for this publication is held by Universities UK. The material may be

copied or reproduced provided that the source is acknowledged and the material,

wholly or in part, is not used for commercial gain. Use of the material for

commercial gain requires the prior written permission of Universities UK.

Evidence Ltd (http://www.evidence.co.uk)specialises in research performance analysisand interpretation. It has extensive experiencewith and databases on research inputs, activityand outputs relating to research both globallyand particularly for the UK research base. It hasalso developed innovative analytical approachesfor benchmarking international, national andinstitutional research impact. The company hasworked closely with national research fundingagencies and individual staff have experience atnational policy level and as senior institutionalresearch managers.

Evidence holds or has access to a range ofpublication and citation databases derivedprimarily from the databases of ThomsonScientific in Philadelphia, US. The core data arethe expanded Citation Indexes from whichThomson ISI Web of Science® is derived.Thomson ISI Web of Science® currently coverspublications from approximately 8,700 of themost prestigious, high impact research journalsin the world. These are augmented by additionalinformation on publication usage in universitiesderived from research and consultancy work bythe company and its predecessors.

41153 UniUK Bibliometrics 11/7/07 10:40 AM Page ii

Foreword 2

1

Summary 3

2

Background 5

3

Bibliometrics as indicators of quality 6

3.1 HEFCE’s tender specification

3.2 Why bibliometrics?

3.3 The nature of bibliometric data

3.4 Can bibliometrics produce appropriate indicators of research quality?

3.5 Can bibliometric techniques provide indicators that are transparent and comprehensible?

3.6 Bibliometric indicators of research quality in non-STEM disciplines

4

An indicator process 15

4.1 What variables should be included?

4.2 Combining indicators to produce overall ratings or profiles

4.3 What is the correct citation count?

4.4 The timeframe for bibliometric analysis

4.6 The population for assessment

4.7 Equal opportunities

4.8 How the broad subject groups should be defined and constructed

4.9 What is assigned to subject groupings?

4.10 Accommodating research differences

between subjects

4.11 Interdisciplinary research

4.12 Data acquisition, collection and preparation for analysis

4.13 Potential costs and workload implications for HEFCE and institutions

5

Other Issues 35

Notes 37

Contents

Universities UK The use of bibliometrics 1

41153 UniUK Bibliometrics 11/7/07 10:40 AM Page 1

2

In preparation for the funding councilconsultation that will take place later in the year,Universities UK commissioned Evidence Ltd toexplore some of the detailed and technicalissues that arise when using metrics in theresearch assessment process. Its report doesnot suggest a preferred approach but identifies anumber of issues for consideration. We arepublishing this report with the aim of briefing thehigher education sector on the kind of issuesthat it will need to consider when responding tothe forthcoming consultation. The report willalso help Universities UK to formulate itsposition on the development of the newframework in the context of its support forreplacing the research assessment exerciseafter 2008. It is essential that the sector fullyengages with this important consultation withthe aim of ensuring that the new arrangementsmaintain the excellence of the UK research baseand have the full confidence of the academiccommunity.

Professor Eric Thomas

Chair

Research Policy Committee

Universities UK

In December 2006 the Government announcedthat a new framework for assessing and fundinguniversity research would be introducedfollowing the completion of the next researchassessment exercise in 2008. The sector haswelcomed the key features of theannouncement, which includes the creation of anew UK-wide indicator of research quality. Theintention is that the new framework shouldproduce an overall ‘rating’ or ‘profile’ of researchquality for broad subject groups at each highereducation institution.

It is widely expected that the ratings will initiallybe derived from bibliometric-based indicatorsrather than peer review. These indicators willneed to be linked to other metrics on researchfunding and on research postgraduate training.In a final stage the various indices will need to beintegrated into an algorithm that drives theallocation of funds to institutions. The qualityindicators would ideally be capable of not onlyinforming funding but also providingbenchmarking information for higher educationinstitutions and stakeholders. They are alsoexpected to be cost-effective to produce andshould reduce the current assessment burdenon institutions. The Higher Education FundingCouncil for England is currently working on thedevelopment of the new arrangements. Its workincludes an assessment of how far bibliometrictechniques can be used to produce appropriateindicators of research quality and to evaluateoptions for a methodology to produce and usethese indicators.

Foreword


Universities UK The use of bibliometrics

‘Citations per paper’ is a widely accepted indexin international evaluation. Highly-cited papersare recognised as identifying exceptionalresearch activity. These are not usuallyapplicable to individual researchers but ifincorporated in an approach to profiling theoverall output of research units they could proveof value. If such profiling were associated withan analysis of performance trends then thatcould lead to an acceptable analysis, if otherconcerns can be satisfied.

Citation counts - their accuracy andappropriateness - are a critical factor. Thereare no simple or unique answers. It isacknowledged that Thomson databasesnecessarily represent only a proportion of theglobal literature. This means that theyaccount for only part of the citations to andfrom the catalogued research articles, andcoverage is better in science than inengineering. The problems of obtainingaccurate citation counts may be increasing asinternet publication diversifies. There are alsotechnical issues concerning fractional citationassignment to multiple authors, relative valueof citations from different sources and thesignificance of self-citation. The time framefor assessment and for citation countingrelative to the assessment will also affect theoutcomes and may need to be adjusted fordifferent subject groups.

The population to be assessed needs to bedefined, in principle and operationally. Inparticular, is the assessment to be ofindividuals and their research activity or is it ofunits and of the research activity of individualsworking in them? How will this affect datagathering, and how will that be influenced bythe census dates for more frequentassessment? There are equal opportunityissues to be considered. It is unlikely thatbibliometrics will exacerbate existingdeficiencies in this regard, except insofar asresearch managers perceive a sharper degreeof differentiation, but metrics have an inabilityto respond to contextual information aboutindividuals.

It is the Government’s intention that thecurrent method for determining the quality ofuniversity research – the UK ResearchAssessment Exercise (RAE) – should bereplaced after the next cycle is completed in2008. Metrics, rather than peer-review, will bethe focus of the new system and it is expectedthat bibliometrics (using counts of journalarticles and their citations) will be a centralquality index in this system.

The objective of any change in the assessmentmethod should be to sustain recentimprovements in UK research performance.To do this, the metrics system will need notonly to be technically correct but also to beacceptable to and inspire confidence amongthe researchers whose performance isassessed.

Bibliometrics are probably the most useful ofa number of variables that could feasibly beused to create a metric of some aspect ofresearch performance. Thomson Scientificholds sound international databases ofjournals and their citations with good time,subject and institutional coverage. These datahave characteristics (particularly in terms ofpublication and citation cultures of differentfields) which mean that they must beinterpreted and analysed with caution. Theyneed to be normalised to account for year anddiscipline and the fact that their distribution isskewed. These factors will affect analyses andmust be addressed with care.

There is evidence that bibliometric indices docorrelate with other, quasi-independentmeasures of research quality – such as RAEgrades – across a range of fields in scienceand engineering. But such correlations leave asubstantial residual variance and averagecitations per paper would be a poor predictorof grade. Furthermore, there may befundamental differences between informedresearcher perceptions and simple metrics ofresearch quality.

There is a range of bibliometric variables ascandidate quality indicators. There are strongarguments against the use of (i) output volume(ii) citation volume (iii) journal impact and (iv)frequency of uncited papers. A number of newmethods have attracted interest but are eithersuperficial (for example, the h-index) orremain unproven for the present (for example,web-ometrics). Output diversity is apotentially valuable attribute but challengingto index.

1Summary

3


4

The definition of the broad subject groups andthe assignment of staff and activity to them willneed careful consideration. While the RAEsubject groups might appear sensibly to followtraditional faculty structures, this is no longerthe unique structure for research activity. Themost important aspect of the subject grouping,however, is the strategy that is usedsubsequently to normalise and aggregate thedata for finer-grained subjects within eachgroup. This is likely to be complex and to vary bygroup, but the precise level of normalisation ofdata will have a profound effect on outcomes. Itis noted that similar considerations will applyto any other data, on funding or training.

Differences between subjects (at a broad andfine level) mean that no uniform approach todata management is likely to prove acceptableif all subjects are to be treated equitably. Therewill need to be sensitive and fine scaleadjustments of normalisation and weightingfactors, and of weighting between bibliometricsand other indicators. There is also a challengeto be addressed in the management ofinterdisciplinary research where, again, theinsensitivity of metric algorithms will miss thebenefits of peer responsiveness.

The management of the bibliometric data willneed to be addressed. The licence cost will besignificant and there will be a substantialvolume of initial work to set up an effectivedatabase for this purpose. In the longer term,this development may produce a net return toinstitutions by providing additional localmanagement information. Internal researchmanagement will be unchanged and much thesame information will ultimately be required.In this context, the role of peer oversightneeds to be clarified.

Profiling methodologies, based on normalisedcitation counts, appear to be the most likelyroute to developing comprehensive andacceptable metrics. They should also proveuseful in differentiating excellence forbenchmarking but the strategy fornormalising the raw citation data prior toanalysis will be central and critical.

A number of potentially emergent behaviouraleffects will need to be addressed, althoughexperience suggests both that manybehavioural responses cannot be anticipatedand that some of these responses couldjeopardise the validity of the metricsthemselves in the medium term.

4



In December 2006, the UK Governmentannounced that a new framework for highereducation research assessment and fundingwould be introduced following the next nationalresearch assessment exercise (RAE2008). TheHigher Education Funding Council for England(HEFCE), in collaboration with other nationalhigher education funding bodies, is developingthis framework. An early priority is to establish aUK-wide indicator of research quality (forscience-based subjects in the first instance). Theintention is that the framework should producean overall ‘rating’ or ‘profile’ of research qualityfor broad (faculty-based) subject groups at eachhigher education institution.

It is widely expected that the index will initially bederived (in part) from bibliometric-basedindicators, but expert subject panels would beinvolved in producing the overall ratings.1 Thebibliometric indicators will also need to be linkedto other metrics on research funding and onresearch postgraduate training. And the variousindices will need to be integrated into analgorithm that drives the allocation of funds toinstitutions.

These quality indicators would ideally be capableof not only informing funding but also providingbenchmarking information for institutions andstakeholders. They are also expected to be cost-effective to produce and to reduce the currentassessment burden on institutions.

This document reviews and comments onbackground issues related to HEFCE’s statedaims:

p to assess how far bibliometric techniques canbe used to produce appropriate indicators ofresearch quality; and

p to evaluate options for a methodology toproduce and use these indicators.

This focus is specifically on published journalarticles and not on published patents.

2Background


3.2 Why bibliometrics?

The research process can be simplified as:

INPUTS – ACTIVITY – OUTPUTS – OUTCOMES

What we are really interested in is the quality ofthe research activity. If it is high then we mightreasonably expect that the output will be goodand that will lead to beneficial outcomes.However, we cannot measure the quality ofresearch activity directly. Although peer expertscan usually establish fairly quickly whether alaboratory or group in their field is any good ornot, that perception does not translate into anobjective measure.

Indicators

To overcome our limitations we use ‘indicators’ -and that is all they are. They indicate what wewant to know but do not measure it directly. Theyare proxies.

Income

One indicator of competence is the ability toacquire a high level of scarce income forresearch support. Such an indicator would bemade sharper if we restrict the analysis only toincome from peer-reviewed sources such as theresearch councils.

Income is a problematic indicator, however, andeconomists might challenge the use of ‘input’ asan indicator of quality under any circumstances.In this instance, there is a cap to the totalavailable income: a limitation determined bypolicy as much as the scarcity or abundance ofquality recipients or the size of the field as awhole. Furthermore, cost varies betweentheoretical and practical projects within a field.So, for these and other reasons, inputs areusually taken as only a partial measure, even ifthey are limited to a ‘peer reviewed source’.

Outcome

Outcomes from basic research, which comprisesmuch of the public-sector research base activity,can be disconnected from the original research.First, the outcome may not be clear for manyyears. Second, the outcome may be affected bymany original discoveries and one discovery maylikewise have many influences on outcomes. Inthe absence of a one-to-one relationship itbecomes challenging satisfactorily to index thevalue of activity.

3.1 HEFCE’s tender specification

What is addressed?

Early in 2007, HEFCE invited contractors todetermine whether bibliometric techniquescould provide indicators that are:

p acceptable and valid measures of researchquality;

p comprehensive across science, engineering,technology and mathematics (STEM)2

disciplines and all UK higher educationinstitutions;

p robust and reliable when applied at the level ofbroad subject groups;

p capable (at a broad level of aggregation) ofidentifying high quality research and ofdiscriminating between varying degrees ofexcellence.

What is not addressed?

It will be noted that HEFCE’s specification doesnot incorporate any explicit reference toacademic confidence in the outcome. However, itdoes make a reference to the question of‘acceptability’ although this begs the question of‘to whom?’

HEFCE will also need to consider theimplications of using bibliometric-basedindicators of research quality within a newfunding and assessment framework. Some ofthese implications will not become clear untilthe system is implemented although, at theoutset, cautionary statements might be maderegarding the likelihood that:

p once any social or economic indicator or othersurrogate measure is made a target for thepurpose of conducting policy, it will lose theinformation content that would qualify it toplay such a role3;

p there are potential behavioural effects ofusing bibliometrics as researchers respond toindicators instead of reality;

p there will be scope for any indicator system tobe manipulated over time, especially by 50,000intelligent academics; and

p substantive criticisms from key stakeholderswill arise because metrics focus on selectaspects of research, particularly the outputsof fundamental research, rather than on theprocess as a whole.

6

3Bibliometrics as indicators of quality



Outputs

Outputs overcome some of these problems.Furthermore, citation of outputs provides anapparent quality measure. For these reasons,bibliometrics provide an attractive source ofresearch performance data. Further benefits ofusing such data are that they cover many fieldsin a similar way and therefore enable somemeasure of comparability. They also cover manycountries in the same way and provide furthervalue in comparisons. And Thomson Scientific®Inc holds a database initiated by the Institute ofScientific Information (ISI) back in the 1960s sothere is a well-developed data structure and apowerful back-resource on which to draw.

Because the Thomson databases provide themost effective and comprehensive ‘currency’ forindexing research performance they havebecome the de facto standard for many researchevaluations in the natural sciences.

3.3 The nature of bibliometric data

Citations between papers are signals ofintellectual relationships. They are a natural,indeed essential, part of the development of theknowledge corpus. They are therefore valuableas an external index about research becausethey are produced naturally as part of ‘whatresearchers do’ and because they are relatednaturally to ‘impact’ and ‘significance’. Not allindicators have such attributes.

Citation accumulation

Once a paper is published it starts to attractinterest from other researchers who may thenuse it as a reference point in their own work,adding it as a cited reference in subsequentpublications. Thus, citations accumulate overtime and uncited papers for any one yeargradually fall in number.

Figure 1:

Citation accumulation for papers

in Geological Sciences.

25,000

20,000

15,000

10,000

5,000

0

90

80

70

60

50

40

30

20

10

0

1981 1986 1991 1996 2001 2002

Cit

ati

on

s t

o e

nd

20

06

Pe

rce

nta

ge

pa

pe

rs u

ncite

d b

y en

d-2

00

2

Citations to date

Uncited papers

Year of publication

Source: Thomson Scientific®,

Analysis: Evidence


8

These are stereotypes of the differencesbetween fields, but they point to the challenge ofcomparability. It should also be noted that thestereotype is a partial truth only, because thereis great variation between fields within thesebroad subject groups. Molecular biologists donot use the literature in the same way asorganismal biologists and bibliometrics cannotcompare the two directly. This further increasesthe complexity of satisfactory quantitativeevaluation.

Rebased impact

Eugene Garfield, the founder of ISI, drewattention in the 1960s to processes fornormalising impact by field and year. Theprocess of normalisation to enable comparisonacross years and disciplines is also referred toas ‘rebasing’ the citation counts to a commonstandard. For this reason, normalised impactindices are referred to as rebased impact or RBI.(RBI appears in various figures in thisdocument.)

Skewed distributions

The distribution of almost all research data isskewed: there are many low-index data pointsand a few very high-index points. This applies tofunding per person or per unit and it applies topapers per person and to citations per paper.These positively skewed distributions typicallyhave a mean (average) that is much greater thantheir median (central point).

In this example, citations in geological sciencesrise and then plateau after eight to ten years, butthis would happen earlier in some other fields.The numbers of uncited papers fall over fiveyears but some are never cited.

It is necessary to take these dynamics intoaccount in any bibliometric analysis. Olderpapers are likely to have had more time toincrease their citation count. There are likely tobe fewer uncited papers in samples from moredistant years. It is therefore necessary to adjustdata by year or ‘normalise’ to some commonstandard. Normalisation strategies arediscussed in more detail in a later section.

Disciplines differ

Time is not the only factor causing systematicdifferences in samples of publication and citationdata. Different disciplines have innate, culturaldifferences in the way in which they use theliterature, in terms of article length, frequencyand citation structures.

In crude terms, biomedical researchers tend toproduce more, shorter papers wheremethodology and prior knowledge areextensively codified in a dense array of citations.Physical scientists and engineers produce lessfrequent but longer papers, with more detailedcontent and fewer cross-references. Thesecharacteristics, not relative quality, affect typicalcitation rates.

Table 1:

UK publication and citation

totals by broad subject group,

2002-2006

Subject group Average Citations Papers in

cites per to date Thomson

paper journals

Molecular biology and genetics 16.15 205,597 12,733

Whole-organism biology 4.82 95,387 19,804

Physics 5.32 177,398 33,352

Engineering 1.96 50,696 25,886



Thomson indexes a huge and diverse range ofjournals (serials) from countries around theworld. It examines the citation relationshipsbetween articles in these journals and selectsabout 8,000 to 9,000 titles for inclusion in itsleading bibliographic products, such as Web ofScience, Science Citation Index and so on.

To be included in the database it is an absoluterequirement that the serial should appearregularly and that it should have a welldocumented and implemented editorial andrefereeing policy.

The selection of journals was in the pastinformed by panels of leading researchers but isincreasingly influenced by relative citation levels.Journals are dropped when they become citedless often compared to similar journals andadded as their citation profile rises.

Geographical and disciplinary coverage is also afactor. In the past it has been clear that thedatabase had a biomedical, Anglophone andAmerican-centric bias. That is still a partialproblem but our analyses for the EuropeanCommission have shown that a wider range oflanguages and countries is now represented andthat the relative coverage of social sciences andhumanities is improving.

Figure 2:

A typical skewed distribution

for citation data.


Analysis: Evidence

This is UK physics after ten year’s citationaccumulation. Most papers are cited less oftenthan world average although the UK average isabove the median and the world average.

Skewed data are difficult to compare visually andto interpret. The average is nowhere near thecentre of the distribution and is no guide to themedian value. Because they follow a negativebinomial distribution they cannot be handledusing parametric statistical analyses and it istherefore necessary to transform them in someway in order to arrive at a more intuitivepresentation and manageable analysis4.

3.4 Can bibliometrics produce appropriateindicators of research quality?

Evidence has argued that bibliometrictechniques can create indicators of researchquality that are congruent with researcherperception.

Is the data source valid?

Are the target data – i.e. the Thomson journaldatabases – likely in principle to produce a validoutcome?

9

400

300

200

100

0

Impact category (normalised to world average)0 Maximum

World average

Fre

qu

en

cy

UK Physics papers for 1995 = 2,323


10

Nonetheless, because many limitations remain,we believe that:

p bibliometric indicators should not be used inisolation if they are presented as singlecitation averages;

p citation data should only be presented in thecontext of funding and activity data. Thisinforms the design of our core products; and

p where bibliometric data must stand-alone,they should be treated as distributions and notas averages. Our Impact Profile™ productmoves away from a citation average toprofiling across the quality spectrum.

Do bibliometrics parallel other qualitymeasures?

In a series of studies for HEFCE, Universities UK,the former Office of Science and Innovation andother UK agencies we have analysed therelationship between variables associated withresearch activity and the categorical gradingassigned by the RAE.

First, RAE data confirm that journal articles arethe preferred mode of output submitted forresearch assessment in the STEM areas forwhich HEFCE seeks to apply a metrics-basedsystem. We assume that the items that aresubmitted normally represent material thatindicates the highest available level ofachievement for the individual (a counterargument put by Professor J E Midwinter,University College London, is that researcherschoose to submit a sample of ‘typical’ work[personal communication]).

Second, a high proportion of the submittedarticles are in journals catalogued by Thomson.This is particularly so for journals that arepresent at relatively high frequency in data onresearch outputs submitted in form RA2 for theRAE.

Third, within a unit of assessment, the averageimpact of the submitted articles for an institutionis correlated with the impact of the total outputfor the institution but is somewhat higher,confirming our earlier assumption about ‘bestwork’.

Fourth, as is also evident from the followingdiagrams, the average citation impact tends toincrease with the grade awarded by the peerreview panel. The ‘goodness of fit’ betweenimpact and RAE grade can be looked at via amore direct plot.(Figure 4)

Thomson relies on commercial credibility. It is inthe company’s interest to ensure that what itcovers is what researchers (for whom nationalagencies are usually the proxy customers) wouldagree is a sound representation of the bestcurrent research.

Overall, therefore, the database is a reasonablerepresentation of higher quality researchpublications. Analytical outcomes of these datashould lead to a valid indicator.

Are bibliometric outcomes linked to researchquality?

There are very few reports that comprehensivelyestablish a relationship between bibliometricimpact and any other, independent, evaluation.That is not to say that the efficacy ofbibliometrics should necessarily depend onestablishing any correlation. It may be thatbibliometrics measure one dimension whileanother metric approaches a differentdimension. The cartography created by aplurality of partial indicators may then reconcileto a third, subjective perception.

In practice, the presence of an article in a journalwith good editorial practice suggests it has atleast some intrinsic merit established by thepeer review of the editor and referees. If thatarticle is then widely cited that adds a secondlevel of peer recognition (and, if the citationsendorse the work, approval). It would besurprising, therefore, if there were no matchbetween bibliometric indicators and peerperceptions.

Our experience is that bibliometrics can be ofsufficient utility to prove amenable andcommercially valuable to research managers.We have extensive experience in identifying andaddressing their limitations. Over fifteen yearswe have built up a comprehensive understandingof the procedures by which publication andcitation data are collected and processed atThomson Scientific. In-house, we constantlyseek to improve the ways in which we manage,analyse and interpret Thomson data.



Figure 3a-b:

The correlation between

average impact of publications

submitted to the RAE by a unit

and the average impact of all

publications in that discipline by

the same higher education

institution over the same period.

Figure 3a

Chemistry (unit of assessment 18)


Analysis: Evidence

Figure 3b:

Psychology (unit of assessment 13)


Analysis: Evidence

11

2

1.5

1

0.5

00 1 2 3 4 5 6

� Grade 3a

� Grade 3b

�� Grade 4

� Grade 5

� Grade 5*

2

1.5

1

0.5

00 0.5 1 1.5 2 2.5

� Grade 3a

� Grade 3b

�� Grade 4

� Grade 5

� Grade 5*

HE

I 5

yr a

vera

ge

RB

I 1

99

6-2

00

0H

EI 5

yr a

vera

ge

RB

I 1

99

6-2

00

0

RAE 2001 mapped articles (RBI)

RAE 2001 mapped articles (RBI)

Spearman r = 0.67, P<0.001

Ratio mapped/NSI = 1.63

Spearman r = 0.57, P<0.001

Ratio mapped/NSI = 1.93

�


12

Conclusions on the validity of bibliometricindicators

At a grand level, there is sufficient evidenceavailable from experience and analysis to justifythe general use of bibliometrics as an index ofresearch performance. This has been founduseful as part of management information inuniversities. But the application of this as adetermining factor at a more narrow level is notjustified by these analyses and there are manyfactors (considered below) that would affectspecific outcomes.

3.5 Can bibliometric techniques provide indicatorsthat are transparent and comprehensible?

As well as the technical limitations ofbibliometric techniques there are perceivedlimitations and potential criticisms fromstakeholders. A past lack of transparency hasbeen one of the causes of a wide range ofmisconceptions that have developed. Thetransparency of the indicators is an issue ofinterpretation:

p how do most researchers and otherstakeholders currently interpret bibliometricinformation?

p how can bibliometric indicators be betterpresented so as to improve theirtransparency?

Figure 4:

Average citations per paper -

data from RAE2001 for biology

(unit of assessment 14).


Analysis: Evidence

Each blue data point is the average impact of thepapers in an institutional submission and eachgrey square is the integrated average at a statedgrade.

This shows several things. At a gross level, theaverage rebased impact at each grade (that is,the average impact taken across all the unitsthat were awarded that grade) progressesupwards steadily with grade. The average valuefor grade 4 units is around world average, whichalso makes sense in terms of the RAE criteria.So not only is there a similar progression but therelationship is coherent.

Residual variance

There is rather less good news when oneconsiders the variation within any grade. It thenbecomes evident that the average impact for anystated unit within the grade band can be veryvariable. There is, in other words, a great deal ofresidual variance whatever the value of thecorrelation.

To put this another way, in a metrics-basedsystem, the information that a unit had anaverage impact close to world average would notenable one to tell whether that unit was 4 or 5graded, or whether it might even be a very good3a or a bibliometrically weak 5*.

3

2.5

2.0

1.5

1

0.5

0

1 2 3b 3a 4 5 5*

World average = 1.0

Rebased Impact

Average Rebased Impact

Re

ba

se

d im

pa

ct

(19

96

-20

00

)



Mathematics

For mathematics and statistics, journal articlesformed over 90 per cent of items submitted onform RA2 for the RAE in 2001. Fewer than 70 percent of these articles were in Thomson-indexedjournals, however, which compares to 75 percent on average for STEM subjects. Thissuggests that there are data limitations, whichare compounded by behavioural factors thatproduce an unusually high rate of ‘uncited’papers.

Social sciences

In 2005/06, Evidence carried out two studies forthe Economic and Social Research Council(ESRC) on the validity of bibliometrics and theuse of research performance indicators in UKsocial sciences6.

Our study involved a review of available datasources, a bibliometric analysis of RAE2001 RA2submissions for each unit of assessment and aconsultation with researchers and learnedsocieties. Bibliometric measures werereasonably robust for some subjects (forexample: psychology and economics) becausethey were already in current use and accepted bystakeholders. In other subjects, particularly inapplied and policy-related areas where keyoutputs were less likely to be in the form ofjournal articles, bibliometric indicators were ofless significance, technically difficult to produceand less likely to be acceptable to researchers asa measure of quality.

In a second study we found that manyperformance indicators used in science werealso employed by social scientists, but in anindividualistic and expert way that did nottranslate into a systemic algorithm. There was acredibility gap between what researchers wouldthemselves use within their community and whatthey would accept in application to theircommunity.

Arts and humanities

Publication mode is different again in the artsand humanities, with a strong preference forbooks and monographs. Such outputs are notgenerated rapidly, nor are citations widelydistributed and catalogued. There are fewerresources to pay for on-line databases, and noneto pay for staff to index and cross-reference thecitations in the publications.

The literature is insufficiently informativebecause most analysis and interpretation is doneby experts (scientometricians, bibliometriciansand other analysts), or at least those fairlyfamiliar with the data and methods. As a result,some of the literature raises issues which arecertainly worthy of consideration and help us tounderstand specific outcomes but which do notnecessarily lead to pragmatic solutions thatwould support management applications. Forexample, few scientometric groups haveexperience in research management or in thedifferentiation between formal and workableoutcomes.

Our experience with staff in institutions whomwe have met during surveys, consultancy and atconferences is that a wide range ofmisconceptions about bibliometrics are incirculation, and have been for many yearsdespite regular rebuttal5.

There will need to be substantial work to dispelconcerns. HEFCE will, of course, have theresponses from the earlier DfES/HEFCEconsultation on the original proposals for a shiftto metrics. This is probably the most currentdatabase of informed opinion and might allow itto anticipate and prepare for many of theconcerns that should be addressed.

It would be reasonable to expect that the nextstage of consultation would make explicitreference to that. The outcomes could becompiled into a brief document for the relevantwebsite, as part of the frequently askedquestions that are likely to be required.

3.6 Bibliometric indicators of research quality innon-STEM disciplines

It should be noted that HEFCE’s immediatepriority is to develop appropriate metrics forSTEM subjects with research in non-STEM areasstill being assessed by a peer process of somekind.

Despite this, it is worth considering thelimitations on the use of bibliometrics for non-STEM subjects. They fall into two categories.First, there are data-limitations whereresearchers’ outputs (of various types) are notcomprehensively catalogued in bibliometricdatabases. Second, there are ‘behavioural’limitations because of the ways in whichresearchers in these disciplines cite, or do notcite, previous work.

13


14

An evaluation by Evidence confirmed that theWeb of Science® publication and citation datafor Arts and humanities disciplines were ratherlimited. It was found to be strongly Anglophone,limited in scope and with a marked Americanfocus. It could be used to make someassessment of leading research units in the UKin some areas where journals were more widelyused, but other areas were hardly touched andinternational comparisons were invalid.



There are some circumstances where volumemight be useful for management information.For example, trends in output volume mightindicate some aspect of activity while the relativesize of two units might be an indicator of theirresearch capacity. But these are issues forfurther analysis, not for an indicator process.

Diversity of outputs, by journal and subject

The subject diversity of papers produced by aunit, department or institution is informative butshould probably not be included as abibliometric indicator.

A key indicator that we have developed for theformer UK Office of Science and Innovation isbased on research diversity. The concept is that amore diverse research base is also more agileand responsive. This is therefore a desirableattribute and might reasonably be one toencourage.

Diversity is not necessarily scale independent,however, because greater capacity gives greaterroom for sustainable diversity. Hence, large unitsare more likely to carry diversity than smallerunits. So although this is an informative indicatorit is likely to favour larger institutions anddepartments irrespective of their quality. Wewould not recommend using it as a metric forresearch assessment without furtherinvestigation.

Citation volume

The number of citations acquired by a unit,department or institution should nnoott be includedas a bibliometric indicator for the followingreasons:

p this is an indicator of market share, not ofperformance; and

p if more papers are published then thelikelihood is that citation count will increasebecause there will be some cross-referenceand there are more ‘targets’ to be cited.

Journal impact

Journal impact factor for assessed publicationsshould not be included as a bibliometricindicator. (See also Table 1).

p typical citation rates vary between broadsubjects: biology papers are on average citedmore frequently than physics papers;

p citation rates also vary within broad subjectgroups and thus affect individual journalcitation rates;

There is unlikely to be an independent range ofsolutions for each of the potentially problematicissues which will arise. The outcome has to bemethodologically workable, so we see the needfor optimisation within a pragmatic model. Thisimplies that there may be some policycompromises, i.e. a workable, cost-effectiveoutcome that satisfies a high proportion offactors may still leave some issues unresolved.Ideal solutions for each issue could, by contrast,make an overall solution unworkable orprohibitively expensive.

In this context, we note that there is presently afocus on bibliometrics in isolation, whereas theultimate implementation must set this indicatoralongside others on funding and training (forexample). Conclusions reached at this stage maytherefore differ from those that would bereached where the research process isconsidered as a whole.

4.1 What variables should be included?

For HEFCE, this should be the core of its work todevelop a new assessment system, assumingthat stakeholder reservations on principles andmethods can be met.

The bibliometric data can provide a number ofcomponent variables. This can be developed as awell-structured analysis, but the elements couldequally be seen as different indicators.Components that could be addressed by anappropriate model would include the following.

Output volume

The number of papers produced by a unit,department or institution should not be includedas a bibliometric indicator for the followingreasons:

p the quantity of outputs has no direct bearingon quality; and

p the UK’s output has in the last few yearsreduced as share of world total without anydetriment to quality. In fact, the UK isproducing fewer uncited papers.

The fractional assignment of output to authorsand institutions has also been mooted. Thiswould potentially have a severe impact oncollaboration, since it would reduce the net valueof an output to an author who ‘shared’. Indiscussing the assignment of citation counts(below) we note that fractional assignment ofoutputs favours those who collaborate least,such as the United States which compares betterwith the European Union in analyses using thisapproach.

4An indicator process

15


16

There has been little work on the nature ofuncited papers or on methodologies foraccounting for the important work that identifiesless fruitful areas of investigation, but whichitself remains uncited (if indeed it does remainuncited). It is argued that publication of negativeresults is a desirable component of a cutting-edge research base. It is certainly not to bediscouraged because it increases efficiency byavoiding repeated errors in choosing paths toexplore.

New methods

There are alternative methodologies of which thecommunity will be aware and may wish to makecomparison. For example:

p Hirsch (h) index – this is unlikely to proveeffective for HEFCE’s purpose because itworks better for high output-high impactresearchers and it produces only a singlemetric with low information content7. It is notapplicable to the general body of researchers.

p Citing rank index – this is akin to Google pagerank indexing, in that the number of links ismoderated by the quality of those links. Acitation from a high impact journal is worthmore than citation from a low impact journal(we discuss this further below). This might‘control’ for spurious self-cites from multiplelow-value sources but it increases datarequirements and management costs and theoutcomes are, as yet, unproven.

Average citations per publication

The average citation count of papers produced bya unit, department or institution could beincluded as a bibliometric indicator.

This is often referred to as a measure of ‘impact’,from the original recognition by ISI’s founder,Eugene Garfield, that papers cited morefrequently than average within their field have agreater ‘impact’ on the work of others8.

This index of research quality is widely-used bythe scientometrics community. It has beenemployed extensively for many years byThomson Scientific® and by ISI, its predecessor.More recently it has been used in, for example,the European science and technology indicators,by the CWTS bibliometrics research group at theUniversity of Leiden (which endorses it under thelabel of a ‘crown indicator’) and in our own PSAtarget indicators for the Office of Science andInnovation9.

p the variance is due to characteristics such asfield size, publication frequency and citationculture, not to any innate difference in quality.

The impact factor of a journal is an issue ofsignificant commercial interest. There is nodoubt that publishers seek to increase theiraverage citation rates and believe that by doingso they will increase the number of subscribersand perhaps affect the quality of paperssubmitted for inclusion.

It is not true that papers published in lowerimpact journals are innately of lesser qualitythan other outputs. On the one hand, the processof getting a paper accepted for publication inNature (for example) is highly competitive. Topass the editorial and refereeing process is anindication of significant interest and likely value.On the other hand, many papers submitted torelatively low impact journals are targeted atspecific channels that increase the likelihoodthey will be read by either a particular group ofresearchers or a particular practitioner or usergroup.

UK soil science is an example of an area with lowimpact journals but where outputs aredeliberately targeted at users. Our work for theDepartment of Environment, Food and RuralAffairs (Defra) has shown that UK soil science isof high relative international impact and itsutility within the UK and elsewhere isunchallenged. It would be extremely unfortunateif such research were coerced into high impactjournals not read by the relevant users.

Uncited material

The number of uncited papers produced by aunit, department or institution should not beincluded as a bibliometric indicator for thefollowing reasons:

p the numbers of uncited papers in any ‘cohort’or sample falls over time, so account needs tobe taken of time since publication;

p we do not know why any specific papers mayfail to be cited: it may be poor quality but itmay contain important but negative results.

The frequency of uncited papers provides usefulmanagement information but it is notnecessarily useful for a metrics algorithm sinceit requires a reasonable level of informedinterpretation to make sensible use of theinformation.



Thomson has established a criterion for ‘highly-cited’ which captures the most frequently citedone per cent of outputs after taking into accountthe field and year of publication. The UK hasabout 13.3 per cent of world papers that meetthis criterion, which is even better than its shareof total papers. Rather than looking at the totaloutput, some evaluations focus on thesepublications on the assumption that, if they areunusually highly-cited, they are likely to havemade the greatest contribution within their field,or to innovative products and processes.

There is no doubt that highly-cited papers areassociated with exceptional research, but themetric is a poor index of more general researchactivity. The threshold is so high that for manyfields there would be few UK institutions that hadmore than a handful of papers in the index.

Profiles

This is the most informative approach tobibliometric assessments of researchperformance.

We noted above that bibliometric data should beconsidered in terms of distributions rather thanaverages where they must be used in isolation.This helps to overcome the extent to which the‘average’ disguises the natural skew in the data.For management information, a profile of ‘impact’is a helpful illustration that shows howperformance is distributed and how it compareswith a reference profile. For HEFCE’s purpose, theissue would be how to extract key variables fromsuch a profile in order to capture the essentialcharacteristics algorithmically.

The characteristics of citation accumulationmean that impact must always becontextualised. That is to say, account must betaken of both the year and field of publication soas to normalise or ‘rebase’ a specific citationcount against a relevant average and therebyenable comparison between years and – ifnecessary – across fields.

The problem with using an average citation countis that the average in a research performancedistribution has little to do with the medianbecause the data are highly positively skewed(see above). Thus the average tells us little on itsown about the balance of work between poor andhigh quality.

It is critical that the data should be appropriatelytreated before being aggregated. Normalisationstrategies are a critical part of any metrics-based methodology and will be discussed inmore detail below.

Highly-cited papers

The number of highly-cited papers produced by aunit, department or institution could be includedas a bibliometric indicator.

Here the naturally skewed distribution of citationdata has been made visually more acceptable bysorting the data into ‘bins’ relative to the worldaverage10.

17

Figure 5:

What an Impact ProfileTM would

look like

35

30

25

20

15

10

5

0RBI = 0 RBI >0-0.125 RBI 0.125-0.25 RBI 0.25-0.5 RBI 0.5-1 RBI 1-2 RBI 2-4 RBI 4-8 RBI >8

Imperial College London – 3,525 Papers

University of Manchester – 2,773 Papers

Bristol University – 1,546 Papers

University of Southampton – 1,800 Papers

Pe

rce

nta

ge

of

ou

tpu

t 2

00

1-2

00

5


Analysis: Evidence

RAE 2008 – sample unit of assessment


18

All of these values could be used as a metric ontheir own, but drawing on them in a structuredway from a prescribed distribution format thatcan itself be observed by both the assessed andthe assessors may help to make the processmore transparent and acceptable. Not least, itshows how the metric components link to theunderlying data and how they are derived, and ittests whether they make sense.

Trends in output and quality

It is arguable that any point-metrics, even thosedeveloped from a profile, are problematicbecause they only capture performance at aninstant. An informed observer, by contrast,should be able to take account not only ofposition but also of trajectory.

Figure 6:

Comparative impact profiles for

bibliometric data from two sets

of researchers working in the

same field in research council

units and in higher education

institutions

Source: unnamed RC, ThomsonScientific®, Analysis: Evidence

What advantage does this offer compared toaverage normalised impact? Evidence recentlycompleted some analyses for a research councilwhich showed the value of looking at citationprofiles as well as averages. Due to extremelyhighly-cited reviews in Nature from aninternational project, one group had a muchhigher average rebased impact but the citationprofiles (above) were almost indistinguishable.

The average normalised citation impact of thetwo groups differs markedly (2.39 versus 1.86)because of exceptionally high outliers in onegroup.

What values might be abstracted from such aprofile?

p uncited papers as a proportion of total;

p proportion of papers cited less often thanbenchmark (UK average, world average);

p proportion of papers above benchmark; and

p proportion of papers cited at exceptionallyhigh levels for field and year (where‘exceptional’ requires a threshold [such as > 4times world average] to be defined as forhighly-cited).

30

25

20

15

10

5

0uncited RBI>0< 0.125 >_ 0.125< 0.25 >_ 0.25< 0.5 >_ 0.5< 1 >_ 1< 2 >_ 2< 4 >_ 4< 8 >_ 8

Higher education institutions

Research council units

Pe

rce

nta

ge

of

ou

tpu

t 1

99

9-2

00

3



4.2 Combining indicators to produce overall ratingsor profiles

We can consider a series of steps to combiningindicator components:

p for each adopted component it would benecessary to produce a single number orindex value;

p indicators are then combined by scaling thenumbers to bring them within the same range(as one would do with head-counts of PhDsand £millions of income);

p suitable weightings need to be determined forthe contribution to quality and valuejudgments of other (funding and training)data; and

p the outcomes need to be integrated.

It is too early to propose a ‘best’ approach tocombining indicators but it is recognised thatUniversities UK will want to have some idea ofpossible approaches and the following is as anexample using the notion of a profile ofincreasing impact (normalised for year and by asubject related factor)11.

The curve in the figure shows the proportion ofpapers in any category, not the numbers. Table 2(over page) shows an example of sampling andweighting to produce a single metric for theseprofiled papers.

Figure 7:

Impact profiles for two

research-based institutions.

It may therefore be desirable to include a factorthat indexes current (or recent) performanceagainst past performance and gives a greaterweighting to improvement and to a sustainedprofile than one that is declining. There is a two-fold gain:

p less weight is placed on historical orreputational aspects and there is likely to be agreater reward for rising stars than oldtroupers;

p there is a reward for the management abilityto sustain performance.

This also emphasises the integratedperformance of a unit rather than the peakperformance of individuals.

Conclusions on variables to use as metrics

None of these metrics is uncontroversial.Different methodologies would more or lessreadily produce such a range of variables. The h-index, for example, would produce only a singlevariable. Impact Profiles would produce data forall the above variables.

Not output volume. It should reflect productivity,but if used as a performance indicator it couldspur unnecessary levels of trivial output.Fractional assignment by author and institutionwould be even more problematic. The requiredbalancing factors make the system undulycomplex.

Not uncited publications. They include bothimportant work, which remains uncited becauseit valuably refers to blind alleys (innovativeresearchers are more likely to detect these thanfollowers) and inconsequential work, whichremains uncited for its lack of value. Indexing‘uncited’ volume may be open to manipulation.

Not journal impact factors. They are not an indexof quality and their use in metrics would perturbbehaviour. Variation in journal quality is linked toother factors that need to be considered in datanormalisation strategies (see below).

Overall, we anticipate that a methodologyemphasising the distribution of cited papers ismost likely to be acceptable to moststakeholders, but they will need convincing offairness, balance and sensitivity to disciplinaryculture and nuance.

The adoption of some measure that takestrajectory as well as snapshot performance intoaccount would be desirable, to reward improvingand sustained overall performance as well asindividual peaks.

19

25

20

15

10

5

0

0 >0 -0.125

0.125-0.25

0.25 -0.5

0.5 - 1 1-2 2-4 4-8 > 8

RBI

Pe

rce

nta

ge

of

HE

I o

utp

ut

19

95

-20

04


Analysis: Evidence1960s campus institution

Leading research


20


Analysis: Evidence

These are real institutions. The profile shows thedistribution of quality, while table 2 abovetranslates this into the categorised numbers ofpapers. These categories can be changedaccording to the significance attached todifferent parts of the profile.

Once collated into the designated categories, thenumbers of papers are weighted and theweighted counts summed to produce a score:the ‘metric’ of research quality. The weightingscan also be adjusted to attach greater value tosome areas.

The methodology as implemented reduces thedegree of selectivity between A and B, but achange to the weighting (for example changingthe cited > 4 weighting from 4 to 8) would restorethat differential.

The weighted score, once the categories andweightings are agreed, would then feed into afurther stage in the development of HEFCE’smetrics. At that point the bibliometric indexwould need to be scaled against the funding andstudentship components.

It will be seen that this approach has somesimilarities to the revised 1* – 4**** gradesystem proposed by Sir Gareth Roberts andadopted for RAE2008. This principle would affirmfor stakeholders a link in the progression fromRAE2001, through RAE2008 and into the metricssystem.

Table 2:

The weighted score from the

metrics process is compared

here with an ‘impact power’

measure derived from RAE data.

Category Uncited Cited < world Cited > 1, < 4 Cited > 4 Weighted Impact

average world average world average score power

Weighting 0 1 2 4

University A 640 993 1,053 303 4,311 5,703

University B 159 246 202 25 750 882

4.3 What is the correct citation count?

It is essential that there should be confidencethat if citation impact is to be used to indexquality then the citation counts are accurate andcomplete. It is also desirable that there shouldbe agreement on how the accumulated citationsare then categorised.

Database accuracy and completeness

Thomson uses well-established algorithmsdeveloped by ISI to link new publications throughits reference lists to older publications and henceto create citation counts for those older outputs.

This is not a simple process. While theresearcher could take a reference list andimmediately interpret where to go in a library tofind a specific item this is not always the case fora machine. Authors use diverse abbreviations forjournals, provide inaccurate year and volumenumbers, incorrect pagination and imaginativevariations of article titles.

Thomson algorithms use field combinations tomake a match, and the accuracy of citationcounts is not usually seen as a serious issue.Even so, not all citations are collated, which maybe an issue of concern to those in fields withtypically low citation rates where the differencebetween three and four cites for an article maybe significant. Opportunities to validate data maybe desirable, to create a higher level ofconfidence in the underlying data.

More importantly, it should be understood thatThomson data provide an internal linkage. Whatis being counted in the citation count for a statedarticle is the sum of citations from other articlesin the journals catalogued by Thomson. Citationsfrom other (non-source) items including booksand conference proceedings are not counted.




Analysis: Evidence

Does this matter? The table above shows thenumbers for a large sample of articles for aleading research university. As much as aquarter of cited material in science is ‘outsideThomson’ and much more in engineering. We donot know the number of ‘outside’ items that citeinto Thomson sources but we might assume thatit is proportionate across fields.

The conclusion must be that a significantnumber of citations to published material are notindexed, that this matters because it variesbetween fields, and that this should be taken intoaccount in normalising data before aggregation.

Cites to multiple versions

Because of the accelerating pace of researchand because of the possibility of making materialavailable electronically long before it is availablein print, there is an emerging problem of ‘versionconfusion’.

It has been a practice in engineering to makesome papers available both as ‘publishedproceedings’ in well-established conferenceseries and as journal articles. This is anunproblematic part of engineering culture, but itcan reduce the count of citations to the journalversion because researchers are alreadyfamiliar with and are citing the proceedingsversion.

These cites to multiple version could be collated,in theory, and a single reconciliation could beproduced. In practice it is infeasiblemechanically to distinguish between theidentical and the similar. This problem ofreconciliation becomes much worse with draftversions and preprints available on institutionalwebsites. Titles, length and content may alsodiffer slightly but sometimes significantlybetween these versions.

If this tendency proliferates, and becauseimmediacy is valued it is likely to do so, then thecitation tallies for journal articles may becomeless valid indicators of impact. This is not likelyto happen at the same pace in all disciplines.Some, such as physics and computing, alreadymake more frequent use of pre-print andelectronic posting than do others such asbiology.

It is claimed12 that search engines will be able tocollate citations to multiple versions and tocategorise similar citations to different versions,but the relationship of this to research qualityremains unproven.

This problem may be significant for automatedmetric systems but is less important for peerreview systems, where the peers are able torecognise a submitted article that stands as asignifier of record for what may have beenseveral interim versions.

The consequence is that citation metrics that arevalid now will need to be re-evaluated regularlyand across different fields to ensure that theyremain valid as publication practices change.

Fractional assignment of cites to authors

It is a moot point as to whether all citation creditshould be attributed to all authors. This is likelyto be an increasingly important issue in a worldwhere a rising proportion of research papers arecollaborative and where other research showsthat collaboration produces higher-value papers.

Table 3:

Single higher education

institution sample: material

cited by Thomson articles and

not within the database

21

Unit of Assessment Number of Number of Cited items that are

source articles cited items not Thomson articles (%)

All subject areas 62,965 890,876 38

1 Clinical laboratory sciences 13,125 219,613 20

14 Biological sciences 9,501 202,200 24

18 Chemistry 5,149 96,768 27

30 Mechanical engineering 1,466 24,368 41


Source: OSI PSA target Indicators

2006

22

The UK has recently increased the level of itsinternational co-authorship so that about 40 percent of its journal articles have at least one non-UK co-author address. That is a fairly typicallevel of collaboration across the EuropeanUnion. The United States is less collaborative(about 25 per cent of its papers have aninternational co-author) but is still linked to 170of 192 countries analysed. The outcome has amarked impact on analyses of output that usefractional assignment. European Commissionscience and technology analyses (whole-articleassignment) suggest that European Union andUnited States citation counts are higher thanNational Science Foundation analyses (fractionalassignment).

The National Science Foundation analysis (top)shows that the United States and the EuropeanUnion retain a smaller share of world citationswhereas the Office of Science and Innovationanalyses (bottom) suggest that they are nowmore similar.

Figure 8a-b:

Fractional assignment of citation

counts to author institutions

results in a lower value outcome

for more collaborative entities.

For example, if an article is authored by twopeople rather than one should they each beallotted half of the citations for the purposes ofindexing their individual research excellence? Ifthe authors are working in separate institutionsthen should the institutional citation collationsbe reduced proportionately to take account of themultiple authorship? If there are three authors,two in one institution and one in a second thenshould the citations be split proportionately topeople or to institutions? What if one author hasrecently moved institutions and was only givingthe address elsewhere as a courtesy?

There are extreme examples: what should bedone with astronomy papers that have over 100authors, from up to 50 countries and a greaternumber of institutions? If fractional assignmentis applied to the RAE data, would it affect onlyassignment of authorship within the UK or beapplied internationally?

Should greater weighting be given to leadauthors? How do we identify lead authors, whenpractice varies between disciplines so that thelead is sometimes the first named author andsometimes the last named?

What about fields where authorship is strictlyalphabetical? In a recent study for one of theregional development agencies, Evidenceanalysed the ‘lead institution’ for highly-citedpublications from select fields of policy interest.This appeared to show a disproportionatebalance towards one institution, inenvironmental sciences. It emerged that, bychance, staff in that institution in that area hadsurnames with alphabetical precedence. Theoutcome, apparently indicating a particularstrength at one location, was spurious.

Fractional assignment and collaboration

If a procedure involving fractional assignment ofcitations were adopted then could this affect thelevel of collaboration with researchers workingin and across institutions?

60

40

20

0

60

40

20

0

1992 1997 2003

1996 1999 2002 2005

Source: NSF Science &

Engineering Indicators 2006

USA EU-15

USA EU-15

Sh

are

of

wo

rld

cit

ati

on

s (%

) S

ha

re o

f w

orl

d c

ita

tio

ns (%

)



Evidence has not sought to make use of weightedcitations because we remain unconvinced thatutility can be revalued in the ways indicated. If anitem is found useful enough to justify a citationthen that is a universal expression of value. Weaccept that something that contributes tocutting-edge research might be deemedparticularly valuable but we would not acceptthat a journal impact factor is the right scale toexpress such relative value. We would beprepared to revise this opinion, particularly inregard to open-access and archived publicationdatabases, if future work showed that suchweighted measures made sense in terms ofresearcher utility.

Self-citation

Is a citation to one’s own work different from acitation from someone else? Some peoplebelieve that there is a difference: that self-citation is an undesirable factor in citationanalyses and that self-citation should beexcluded prior to evaluation.

Citations are part of the process of building onprior knowledge. They establish a thread ofdevelopment, a natural and appropriate part ofwhich is reference back to one’s earlier work. Itis a cultural necessity.

A recent paper13 has raised new argumentsabout the influence of self-citation: “the moreone cites oneself the more one is cited by otherscholars … our models suggest that eachadditional self-citation increases the number ofcitations from others by about one after oneyear, and by about three after five years … thereis no significant penalty for the most frequentself-citers”.

The statistics are objective but raise deepersubjective issues about why self-citation mightbe inappropriate, whether multiple self-citersare doing anything different from the normalsociology of research, and why self-citationmight be penalised – which most researchersmight find a surprising proposal!

What is self-citation?

First, what is ‘self-citation’? If Adams (2005)cites Adams (1996) then the case seems simple.But what if King and Adams (2007) cite May andAdams (1998)? Does the presence of Adams taintthe reference? And what if someone in May’sgroup at Oxford cites May: is that still self-citation because it is within a small and localteam or is the cross-reference now not ‘self’?

There are implications for the RAE metrics.Fractional assignment of papers and theircitations would work against collaboration: donot share your glory. Evidence has an example ofthis for a collaborative programme in Austria,where the lead institutions collaborated on manyoutputs but reserved the highest-impact articlesfor single-institution authorship.

Thomson has reviewed fractional citationassignment and concluded that it does notsignificantly affect outcomes of large-scaleanalyses. If it does have an effect at a fine level,then that effect is itself open to interpretation forthe reasons discussed.

For Evidence, our practice has been to avoidfractional assignment entirely and to allow allauthors, institutions and countries the fullbenefit of the citations received. The argumentsfor treating the data otherwise are too complexto produce any wholly satisfactory generalapproach. Furthermore it is clear that much ofthe data would need to be reviewed manuallyrather than algorithmically and the emergentissues would then proliferate.

Cites in terms of who or what is citing

It is argued that some citations are worth morethan others because of the author who is citingor because of the journal from which the citationis made. This is akin to Google’s Page RankTM

algorithm (link counts are like citation counts) aswe noted in a previous section.

For example, if an output from a highly-citedresearch group cites some prior article then thefact that they are perceived as ‘high impact’themselves should make the citation moresignificant than a random cross-reference.Similarly, a citation from an article in Sciencemight count for more than a citation from anarticle in Scientometrics.

A simple way of adjusting incoming cites for‘value’ would be to use journal impact factors,but these are themselves problematic becausecitation rates vary between fields, as noted.Molecular biology citations would be counted asmore valuable than population biology citations,which might be seen as inequitable withinschools of biology.

23


24

Negative citations

There is frequent concern that some papersaccumulate significant citation counts ‘becausethey are wrong’. There is little evidence of this.The bulk of ill-conceived work that does passeditorial scrutiny and reach publication probablyremains uncited because it is also trivial. A well-known example of frequently cited ‘wrong’ work(Fleischman and Pons on cold fusion) still had asignificant and non-trivial impact on work in itsfield. Interestingly, however, it was not publishedin a conventional medium because of theresearchers’ concern that it would not pass theinitial editorial hurdles.

4.4 The timeframe for bibliometric analysis

Timeframe affects bibliometric data and indices.Immediacy is provided by recent data whilesmoothing is provided by longer time ‘windows’.There are two operational issues:

p The time frame over which publications aresampled;

p The time frame over which citations to thosepublications are counted.

It can be argued that although papers in a short,fixed-time window will not accumulate very highcitation counts, they could demonstratesufficient differentiation to categorise quality. Wehave shown (figure below) that early citationrates are a good predictor of longer-termperformance. But, while this may be valid forlarge samples, it is not readily acceptable forindividuals.

If we take a large and research-active group, saya team at the Medical Research Council’sLaboratory of Molecular Biology in Cambridgethen we see constant cross-references withinthe laboratory, often with overlappingauthorship. The laboratory is unquestionablycutting edge: more Nobel prizes than any otherinstitution in the world. It is entirely logical that itmakes extensive reference to its own work sincethat includes some of the most innovativecurrent work.

It is infeasible to build up a significant body ofself-citations without a significant body of peer-reviewed publications in the set of journalscovered by Thomson. And if those self-cites havepassed peer review then they must have beenseen to be appropriate.

This is, we believe, why Fowler found a linkbetween multiple self-citation and additionalnon-self cites14. These were high-profile, leadingedge groups referring to their own work andbeing ‘trailed’ by many others. To ‘penalise’these self-citations would be antithetical toresearch.

Second, how do you remove self-citation? Weneed to check all the papers citing Adams(1996)15 and remove any citations from Adamsover the period 1996 to present. Of course, weneed to make sure these are all from that Adams(in Leeds) and not another Adams elsewhere.There is a problem, because there were actuallythree 'J Adams' in Leeds during the period andthe target 'J Adams' was also employed atImperial College for part of the period.

A machine check is therefore likely to beproblematic. In practice, a researcher mightreasonably ask for a detailed validation of alltheir citations to check that only the right oneswere removed. It would be no good arguing that‘on average’ a mechanical system would work.The individual is indifferent to the average andwill rightly be perturbed at erroneous outcomes.

Third, what is the behavioural effect of removingself-citations? HEFCE would be sending a signalto the system that self-citation is wrong.Behaviour might then change as a result. Peoplewould alter the way they cited, perhaps reducingtheir own valid connections but seeking lessvalid cites from other sources. There would be arisk of a serious disruption to the underlyingculture. In the short term, the reduction in self-citing would undermine the UK’s internationalcitation profile and institutions would fall incomparative league tables.



There are papers which suddenly attract manycitations, when they had been previouslyunnoticed for several years after publication.These are no more frequent than would beexpected by chance, however, given theunderlying distribution of citation patterns17. It isunlikely this affects many researchers and mightbe little different in effect from the outcomes ofpeer review on unnoticed work.

4.5 The population for assessment

Population

‘What is being assessed’ is an issue that we referto at several points. The metrics on researchinputs and outputs are associated both withinstitutions and with people. When people movebetween institutions should the ‘credit’associated with their metrics move with them orshould it retain its association with theinstitution where the activity occurred?

The new metrics could cover population in termsof:

p discipline = ‘all the chemists’;

p management unit = ‘all the staff in chemistry’;and

p staff grade = ‘all the tenured, research-activestaff in chemistry’.

Figure 9

25

100101

1,000

100

10

1

Total papers 1993: 923

First year citation count (1993-1994)

Cit

ati

on

co

un

t 1

99

5-2

00

2


Analysis: Evidence

For large samples of papers, the citationsaccumulated in the first two years afterpublication are a good guide to the likely citationcount for years three to ten. For individualpapers the trajectory is less certain16.

There are different citation accumulation ratesfor different fields. Biochemists cite rapidly andmove on while ecologists rarely expect papers tobe cited within 18 months of publication. A five-year window of analysis will work well for theformer group but poorly for the latter, althoughboth would fall in a single metrics assessmentarea in HEFCE’s structure. Disciplines withinphysics, such as astronomy and solid-statephysics, may be affected similarly.

‘Mayflies’ and ‘sleeping beauties’

There will be concern about whether a shortassessment timeframe provides a valid pictureof the impact of an individual ‘outlier’publication. This may be of two kinds: the mayflythat shows exceptional early impact; and thesleeping beauty, ignored for years and thenfound to be of critical value.

Fleischman and Pons (1989) work on cold fusionmight be taken as a mayfly, because itstimulated much attention before beingrebutted. Most mayflies are of a lesser order but,like cold fusion, present stimulating ideas even ifthe content proves less sound. There is noconvincing evidence that there are adisproportionate number of papers that gainexceptional transitory acclaim.


26

4.6 Equal opportunities

One should review career trajectories in relationto publication profiles properly to assesspotential implications for equal opportunities(for early career researchers and others). Thiswould involve linking individual staff records tohistorical publication records, which is anintensive manual data-processing task.

It seems likely that there are issues to beaddressed in this regard but it is equally likelythat the same issues would have arisen underthe RAE peer review as they are systematicrather than specific to bibliometric or otherindicators.

HEFCE analysis

HEFCE reviewed the data from RAE200118 andconcluded that this revealed issues not about theassessment process but about the selectionprocess in institutions prior to assessment.

p staff aged over 30 were more likely to beselected;

p men (64 per cent selected) were more likely tobe selected than women (46 per cent). Like-for-like comparisons showed that men hadsignificantly higher selection rates thanwomen over a middle-age range between 30and 47;

p unadjusted comparisons showed selectionrates of around 58 per cent to 60 per cent forstaff from most different ethnic groups;

p staff from black ethnic groups had a lowerselection rate of 37 per cent. This was partlybecause a higher proportion of these staffwere employed in departments which did notsubmit to RAE2001; and

p disability was not a significant factor in thepropensity to be selected.

No bibliometric difference on past data

Evidence found that bibliometric impact ofselected staff who submitted outputs forassessment showed no great differences:

p between men and women, if the highly-citedstaff are excluded; and

p between researchers from different ethnicgroups.

Part of this rests on whether funding is there tosupport units within an institution, which wouldthen put forward the staff they have to pay, orresearch strategy within an institution, whichmay then group researchers according tointerdisciplinary and strategic programmes.

Research-active?

Is assessment of all the activity of an institutionor only of the research-active component? Thisis a non-trivial question because it links into themetrics methodology in two ways:

p what volume of material needs to be madeavailable for assessment?

p how are outputs going to be associated withpeople?

And there is a behavioural outcome:

p if the assessment of outputs relates to total oronly to research-active, what effect does thatthen have on the population?

Who selects staff?

In the past, institutions have pre-selected stafffor assessment. They are thought to havebecome more selective in doing so (volume wasless in RAE2001 than in RAE1996) becauseoutcomes affect reputation as well as funding. Ifinstitutions are to select staff then there willneed, for equity, to be some consideration ofwhether that selection needs to be policed andvalidated and whether all institutions areexpected (and seen) to submit broadly the samestaff group or select one of their own choice.

Definitions

HEFCE must define the population in theory andin practice. The present system takes a particularcensus date and then institutions submit acheckable list of staff on stated grades employedat that date. This is not the total researchpopulation either at the census date or over theassessed period but it is well understood andreadily verified.

For the future, on a shorter cycle, there will haveto be:

p an operational definition of what is to beincluded, in terms of institutional and individualresearch metrics;

p an operational definition of who is to be included;

p a definition of when this is done in terms ofcyclical census dates and staff mobility; and

p a protocol for validating the presumptivepopulation.



Bibliometrics

Evidence established a methodology foraggregating research activity into subject groupsat fine and coarse level in work for HEFCE in199720. That methodology has stood the test oftime and has been widely employed since. It hasrecently been tested and validated in work forthe research councils.

Customised clustering could be based on linksbetween Thomson’s output databases, RAEpublication databases and information about thefunding and location of researchers. It could bedeveloped at the outset, or left until a later stagewhen the methodology has been agreed afterconsultation.

What is physics?

The approach Evidence took21 was to look at theuse that different subject groups made of theliterature. We can see that physics (unit ofassessment 19) submits a given range ofjournals for RAE2001 assessment whereaschemistry (unit of assessment 18) submits adifferent but overlapping range. Both are similarto materials (unit of assessment 32) but quitedifferent to biology (unit of assessment 14).

We can therefore cluster physics (as seen byinstitutions for RAE purposes) with chemistryand materials and draw a distinction with aseparate biology cluster.

The EPSRC data, drawn from publicationsassociated with researchers funded by differentphysics-related programmes (i.e. as seen by theresearch council for research purposes), mapwell onto this RAE analysis.

So, whether we look at university units or welook at research communities, we find acoherent and common association through theliterature. Physics can be robustly definedthrough a set of journals and this journal setidentifies links to cognate disciplines anddelimits boundaries with different subject areas.

Other units, other data

This journal-orientated proposal is deceptivelysimple. It should not be forgotten that thesubject clustering has to work with a diversity ofdata and institutions.

What would work (has worked) readily forpublication data in the older ‘big civics’ does notnecessarily make sense for graduate schools, orfor newly emergent and trans-disciplinaryresearch areas, or for applied research in post-92 institutions.

Peers versus metrics

Equal opportunities considerations point to anumber of issues where peer review is entirelycapable of adjusting perceptions for particularcases but metric algorithms can make no suchadjustment.

People who have taken career breaks (typically,women researchers with a family) may have adisjointed publication portfolio. This can beflagged but a metric adjustment for this needs tobe discussed.

Early-career researchers have a smallerresearch portfolio, are still on the learning curvein their discipline and are thus less likely to havea large or widely known body of work. They arelikely to have lower citation counts relative totheir field, though their trajectory would beentirely ‘readable’ for an appointing committee.

We have no evidence that new researchers haveinnately weaker outputs, and the use of fixedtime-windows would put their citationaccumulation on a par with established staff. Asa counter-factor, their output might actually begreater and they would be citing and boostingtheir own work. Furthermore every profile willinclude some new researchers, who could beflagged appropriately by employers.

The problem of institutional responses

More systematic issues affecting gender andethnicity are not made worse by indicators,except insofar as the metrics may accentuatedifferentials because the distributions areskewed. However, if there was a belief that newresearchers systematically affected profiles thenthat would in itself be problematic.

The important consideration here will not beabout the metrics but about the attitude ofinstitutional managers to metrics in relation todifferent groups of staff.

4.7 How the broad subject groups should be definedand constructed19

The aggregation of analysis has an effect on theassessment and on the outcome for institutions.In other words, it affects the way the data arehandled and it affects the way the results areperceived.

27


Who links staff to articles?

Options for making explicit staff-output links areeither through central, procedural assignmentor through assignment by the institution:

p if assignment is made by HEFCE, theninstitutions should require validation of theassignment, so it seems unlikely that theywould not be involved in some way; and

p if the lists of publications for assessment areprovided by the institution then that willrequire validation as well.

Level of analysis

This assignment of evidence may seem slightlyarcane but it is both fundamental and practical.The methodology will require some time toexplore in proper detail. The outcome will affect,first, researcher confidence that what isassessed by the metrics is a true representationof their work and, second, the costs to theinstitutions of working with the system. Anyproposals therefore need to be scrutinised withextreme caution and in fine detail.

For example, the pathway to identifying andanalysing the impact of publications at highereducation institution ‘X’ associated withchemistry research (a set of journals associatedwith chemistry as a discipline) is different to thatrequired to identify and analyse the publicationsof the staff employed by ‘X’ within its school ofchemistry.

By carrying out the analysis at the level of five toeight STEM subject categories, HEFCE will havereduced but not removed the difference betweenthe subject and people analyses compared withan analysis at, for example, unit of assessmentlevel. It will not have addressed staff mobility.

If there is a clear argument suggesting thatbibliometric analyses at aggregate subject levelare indistinguishable from analyses at staff levelthen this would significantly reduce the costs ofany subsequent part of this development. Butthis is unlikely to prove satisfactory from aresearcher perspective.

4.9 Normalisation strategies and aggregation

Creating a basis for data-comparability withinthe five to eight broad subject areas will, webelieve, be problematic. We noted above that theavailability of funding can vary substantiallybetween sub-fields, as does publication culture.There will have to be correcting (normalisation)factors to enable data to be brought together forcomparison.

The clustering will not only have to be mapped todata (hence to the categorisation of data held bythird-party sources) for funding, training andpublications but will also have to work with theappropriate reference systems for normalisingall those data in a metrics’ structure.

If the funding data are taken as an example, thenfunding availability varies as much between sub-fields in biology as does the typical citationculture.

Compensation factors

There is no simple solution to clustering. Nosolution is likely to be particularly satisfactory. Itwill be essential to appreciate the finergranularity within any clustering and then toensure that sufficient account is taken of the finegrained differences to build in compensatingfactors that adjust for differences in such factorsas culture, funding and researcher flow.

4.8 What is assigned to subject groupings?

Staff coverage has been referred to above in thecontext of ‘population’. This is likely to be acontentious area and methodology must belinked to policy objectives.

Are we assessing the subject or staff within thesubject?

A fundamental question is whether theevaluation of ‘research quality’ is about a groupof staff (via their publications) or about adiscipline (via the publications of staff in thatsubject). This is a practical issue as well asconceptual. Material needs to be assigned to thesubject groups for assessment, and that can bedone either by assigning evidence of ‘researchactivity’ or by assigning evidence linked to staff.

Select publications to match staff

For bibliometrics, selection of staff would bemeaningless unless publications are alsoselected to match staff. The more direct orexplicit the link between individual staff andspecific publications, the more costly themethodology for HEFCE and the institutions.

Staff who move

A twist to the question of what is assessedcomes at this point. When staff move betweeninstitutions, are their publications reassignedwith them or do they constitute part of the legacyactivity of the institution they are leaving?

28



p Thomson journal categories;

p at the level of journals themselves?

We tested some options to see the effects ofthese on quality rankings in psychology, biologyand physics and to explore differences. Wecalculated the normalised citation performanceof UK research units for each of three levels ofarticle-aggregation (journal, journal category,unit of assessment – where several categoriesmap to each unit of assessment). We thencompared this with the grade awarded to thatunit in RAE2001. We found that the correlationbetween average normalised citation impact andpeer-reviewed grade does indeed vary accordingto the selected level of granularity.

There is little difference between grade-relatedimpact when citation counts are normalised atjournal level. But higher graded units had astatistically significant higher impact whennormalisation was relative to the Thomsonjournal category or to the journal sets mappingto the unit of assessment22.

Each point is the average citation count to the endof 2005 for the set of journal articles submitted bya stated institution within this unit of assessment,with impact for each article normalised against aworld average. The data are grouped according tothe RAE grade awarded to the institution.

There are differences between subjectcategories in rates of citation accumulation andin typical citation plateaus. For this reason, bothtime since publication and journal category aretaken into account when normalising or‘rebasing’ citation counts to enable indexing andcomparison.

It would be inappropriate to aggregate data atthe level of HEFCE’s broad subject groups unlessthey are first made comparable by a satisfactorynormalisation at some finer level. Over-normalisation will be as problematic as under-normalisation, because it will remove the subtledifferences that the exercise seeks to identify. Itis therefore critical to determine the appropriatelevel for normalisation.

Normalisation

Citation rates vary between field (and sub-fields)and citation counts accumulate over time. Atwhich level should bibliometric data benormalised?

p the broad subject field;

p fields below this (for example, units ofassessment);

29

Source: RAE Manager, ThomsonScientific®, Analysis: Evidence

6

5

4

3

2

1

0

4 5 5* 4 5 5* 4 5 5*

Average impact relative to journal average

Average impact relative to category average

Average impact relative to unit of assessment average

World average

Figure 10:

Variation in citation impact for

biological sciences (unit of

assessment 14) using RAE2001

submitted articles and grades

awarded.

RAE 2001 grade

Ave

rag

e r

eb

ase

d im

pa

ct


30

Other data

The final step in creating metrics will be theintegration of bibliometric with other data. Atthis point there will need to be some weightingattached to the inputs from different sources:how much is the bibliometric indicator ‘worth’compared to the funding and training metrics?The significance of these elements may varybetween subject groups and it should not beassumed that there is one correct weighting toapply.

4.10 Accommodating research differences betweensubjects

Account will need to be taken of not onlydifferences between research areas but alsodifferences within them, particularly in thebalance of mode of activity.

Engineering conference proceedings

Engineering and technology has culturally mademore use of conference proceedings as a keyoutput mode than is typical in the naturalsciences. This is because conferences are abetter route to communicate with the most likelyuser audience, in the private sector. There are aseries of prestigious conferences theproceedings for which form a record ofacknowledged quality.

However, engineers have made increasing use ofjournals over the last decade, as the datareturned to RAE1996 and RAE2001 show. Thedifferences between subjects are therefore lessevident than they were historically, althoughcomputer studies is still very dependent onproceedings rather than journals.

Irrespective of this changing level of journal use,the principles of excellence should hold goodwithin any subject areas. The problem is thatbibliometrics will provide only a partial analysisin these areas whereas they will provide a morecomplete comprehensive analysis in othersciences.

There is therefore likely to be less confidence inthe outcomes of bibliometric analysis forengineering. For this reason, it may beappropriate to give somewhat greater weight inthis subject group to the metrics on funding andon training. Such considerations may also applyin other areas.

The implication is that the material submitted bygrade 4 units is actually sourced from journals oflower average impact than the materialsubmitted by the grade 5 units. Thus, when thelevel of analysis is relative to journal these itemsappear to be of similar impact relative to themedium in which they are published. When theviewpoint is zoomed out to the broader category-level then the higher absolute citation count forthe articles produced by the more highly gradedunits becomes apparent, and even moreapparent at the unit of assessment level.

Normalisation and applied research

Another possibility is that a finer-scalenormalisation would also separate clusters ofapplied but lower-cited research fromfrequently-cited fundamental research of topicalinterest23. Thus, the relative value of appliedwork is lifted at the journal scale but swamped atthe field level.

Conclusions on normalisation

While the pattern varies between broad fields, anupper and lower boundary to the granularity ofsensible normalisation is apparent. Above unit ofassessment level the differentiation betweenfields is lost. Below Thomson journal categorythe differentiation between peer estimates ofquality is lost. It will be vital that data arethoroughly reviewed and the right level ofnormalisation is set for each broad field in anymetrics system.

There is a further note of caution. This analysisapplies only to bibliometric data. A parallelanalysis will be required for funding data and fortraining data if these are used elsewhere in themetrics system.

Aggregation

Pulling material together could follow a diversityof routes, but we suggest that a sound methodwould seek to follow the natural hierarchy ofsimilarity within the source data. This is bestreflected in the similarity of journal usagebetween cognate research areas.

Fields that have similar journal usage will bemost amenable to similar treatment inbibliometric indexing (for example, reducing theneed to use many different normalisationfactors) and will have natural affiliations. Wealready know that chemistry, physics, andmaterials science show strong affiliation as anatural ‘physical science’ group that also showsclear separation from a ‘biological science’group24.



The ‘interdisciplinary’ nature of research has noproper (testable) definition. Unless it is possibleto assign research outputs to an interdisciplinarybasket then they cannot be treated separately forassessment. Assignment cannot be made at ajournal level since some of the most prestigiousjournals (e.g. Nature, Science) are explicitlymulti-disciplinary in content and clearly containsome interdisciplinary articles.

Identifying interdisciplinarity

Nature articles are assigned by Thomson tospecific subject categories on the basis of theircitations (i.e. an article with many references tophysics journals is probably a physics article). Asimilar approach could be taken to a broaderdefinition of interdisciplinarity. Those articleswhich cited prior knowledge across manysubject categories would be objectively moreinterdisciplinary than those which cited onlywithin one category.

Author definition of interdisciplinarity is probablynot sensible. If a very large volume of work wereto be so assigned then it would create seriousmethodological challenges. Less than one percent of outputs submitted to RAE2001 wasflagged for evaluation across panels but it wouldbe reasonable to expect this to rise if it wereknown that a claim of interdisciplinarity led topeer review rather than metric assessment.

Innovation at the margins of core subjects

Interdisciplinary work is often at the margins orinterfaces between established subject domains.It is desirable to evaluate the general associationbetween indexed quality of outputs in core andmargin of the subject clusters. Innovation alsooccurs at the margins. Margins can be identifiedon the basis of the frequency of journal usage.However, if higher education institutionsdetermine staff selection and then assign staff toatypical subject clusters then the issues ofinterdisciplinarity might become marginal to thedisruption to bibliometric profiling.

If the emergence of innovative research areas,which tend initially to be marginal until morewidely accepted and adopted, is a ‘core andmargin’ issue then the methodology should beadapted to avoid incentives that reduce thecurrent dynamism of the UK research base. Apanel can respond to the nuances of workrecognised as innovative but low impact whereasdata metrics on their own cannot.

Basic and applied research

Within subjects and between institutions there isa varying balance between basic and appliedresearch. Because the target for any appliedresearch is partly outside the research base, theimpact is only partly measured by citationscoming from within the research base in laterpublications.

It is therefore inevitable that applied researchwill tend to be cited less, but not necessarilyhave lower impact if impact includes economicas well as scientific factors. It will be necessaryto consider how this can be addressed.

This is an issue which will be of significance forsome stakeholders, particularly research users,who will expect the application of research toreceive some parity with blue-skies research.The implications for national policy and forgovernment innovation strategies, are obvious.In a bibliometrics-based system there is no stepat which non-academic user evaluation can beapplied.

Variation between fields within subject groups

We noted that citation rates vary between areasin terms of time to first citation, citationaccumulation and citation totals. Suchdifferences can be addressed by usingappropriate normalisation factors (as noted) andthese can be applied between areas withinmacro-disciplines (such as the broad subjectgroups) but the differences need to be identified.

It will probably be necessary to adjust within-discipline clusters for factors which go beyondnormalisation and may affect the time-frame forthe analysis and other factors more readilyidentifiable to a subject specialist within peerreview. To take the example of biology, a citationtime-frame of five years would work well formolecular biologists but much less well fororganismal biologists.

4.11 Interdisciplinary research

What is the argument for separateinterdisciplinary assignment? The notion is thatinterdisciplinary work is treated and valueddifferently from other work. If so, it might becited differently and its impact value (citationcount) might be systematically less than coredisciplinary work. This is unproven, but thenotion of different treatment does require someconsideration of why and how a bibliometricevaluation might be applied.

31


32

Assignment to higher education institutions

Conflicts between HEFCE’s central data and thedata provided or validated by institutions willalso have to be addressed and accommodated.

Thomson catalogues about 100,000 articlesevery year that have one or more UK-basedauthors. Because of new address variants,Evidence spends significant staff time every yearanalysing address variations and determiningthe actual institution with which the author isassociated.

For example, by processing 25 years of legacydata we have increased the linkage of articles tothe University of Oxford by 40 per cent comparedwith raw Thomson data assignment.

It is possible, indeed likely, that opportunities fordata collection and preparation will change overthe lifetime of the assessment methodology. Forexample, the shift to open access publishing mayproduce more comprehensive ‘libraries’ ofoutputs for which quality standards can beapplied in a transparent and measurable way.This is not yet feasible, however, nor would manyresearchers accept that current methods (suchas page ranking) are sufficiently well understoodto serve as indicators for HEFCE’s purpose.

A key issue will be whether a centralpublications database is held by HEFCE orwhether publication data are supplied by highereducation institutions. If the former, theninstitutions will need access to validate thecorrect assignment of records to be associatedwith their staff.

Assignment to staff

As noted earlier, the article records will need tobe linked to staff so that they can be linked tosubject groups. Two issues arise:

p author names and addresses are not linkedbut grouped in separate fields. The linkagehas to be made manually;

p author synonyms (two name variants, oneperson) and homonyms (two people, samename and initials) are incredibly common (forexample there are at least three unique Dr FGuillemot’s in UK data). These can only bedistinguished manually.

The UK publishes about 100,000 articles ayear. For the metrics system to work these allneed to be linked accurately to namedindividuals. In 2004, those articles had a total of473,046 authors, not all of whom were in the UK.The numbers and diversity of co-authors areincreasing.

Conclusion

A detriment to interdisciplinary work or to workat the margins of core disciplines is unproven butif it exists and is reified in metrics assessmentthen that would be disadvantageous to UKresearch innovation. It is desirable that HEFCEshould confirm that no such systematicdetriment exists.

4.12 Data acquisition, collection and preparation foranalysis

Bibliometric database

Evidence acquires, collects and processes all theUK publication and citation data held in Thomsondatabases every year as part of its normalbusiness. Preparation for analysis andassociated quality assurance will clearly be a keypart of the development of the relevantmethodology. It may be appropriate to considerthe implementation of an assurancemethodology once the basic process is agreed,but the need to have a process for which suchassurance is feasible is an absoluterequirement.

It will be essential to understand the nature ofthe data, problems with data processing andpotential problems in year-to-year changes.Evidence has experienced many of these overthe last few years. Users need to be aware thatThomson does not normally supply validatingroutines or documentation of changes.

Part of the process of analysis would beestablishing the likely annual timetable for datacollection and reconciliation. This could requirethe linking of staff data from the HigherEducation Statistics Agency, bibliometric datafrom Thomson, local processing and on-linesign-off by institutions. These various elementswill need to be explored and explained toinstitutions.

Factors that will need to be incorporated are:assurance on continuity of baseline data supply;annual cycle; time-frame for citation census andcut-offs; variation in data structure betweenproducts required for UK specification and globalbaseline; year-to-year variations in datacompilation, structure, aggregation and contentwithin core Thomson databases (e.g. journaladditions and deletions); data conventions;article types (article, review, editorial, note etc).



If a more detailed global approach is taken, toaddress self-cites for example, then the costswill rise proportionately.

System costs

If HEFCE decides to set up the indicator systemcentrally then it will need to create and developan expert unit to administer the system.

The only function of this unit will be to run thecyclical metrics analysis, so the overhead cost ofholding sufficient expertise in-house will besignificant because of the peak demand and therange of issues that may then need to beaddressed.

If the data are held centrally then secureinstitutional access for validation of address andresearcher assignment will need to bedeveloped.

Frequent updating will also be required asresearchers move between institutions(assuming that the data records followindividuals rather than sticking withinstitutions).

Institutional costs

The costs to institutions of using bibliometricquality markers will depend on the approachtaken. There are significant start-up costs butsubsequent annual costs are likely to fall quitequickly once the metrics system is established.

An appropriate programme of work could put inplace a bibliometrics-orientated system thatwould not only enable the annual cost toinstitutions to be made fairly small but wouldactually improve their local researchmanagement information systems, reducing netcosts by adding substantive local value.

If data are held centrally then institutions willneed access to the HEFCE database to confirm:

p the assignment of records to institutions (eachrecord may be assigned to multipleinstitutions);

p the reassignment of articles to theirinstitution when staff are recruited fromelsewhere; and

p the link between records and individuals.

It should be assumed that researchers will needaccess to the assessment database in order tovalidate the correct assignment of records withwhich their name is putatively associated.

Citations

If self-citation is identified by HEFCE as aproblem, and these citations are then removedfrom the system, the citations to each UK paperwill also need to be analysed and the self-citesremoved. There were about 8.5 citations per UKpaper on average over the five-year period from2001 to 2005, or about four million citations toabout 500,000 papers.

It will be argued that self-cites can be identifiedautomatically, but each researcher in a low-citing field may want assurance about theaccuracy of this procedure, if not personalvalidation. Furthermore, the accuracy of creatinga ‘self-cite-free world benchmark’ fornormalisation will also be an issue.

4.13 Potential costs and workload implications forHEFCE and institutions

The cost of indicators is a three-part calculation.This will involve:

p a licence from Thomson Scientific® to accessthe data;

p the expert cost:

p initially, setting up the indicator system

p periodically, running the cyclical analysis

p the institutional cost of supplying stated datato ensure matching and validation.

Only the first part of this can be readily stated atthe outset, while the other two componentsdepend largely on the specific methodologyrequired.

Thomson data

To carry out any work to develop and testproposals for a methodology it is necessary toaccess appropriate data from Thomson. Thisinvolves a licence cost for the data. Once paid,that licence fee would cover any subsequent re-use of the same databases during the currentcalendar year (Thomson’s financial year isJanuary-December).

Some other background databases will also berequired for benchmarking, including theNational Science Foundation indicators whichare used to establish UK and world baselines.

33


34

Institutions will need a local system to work withtheir staff to ensure that these assignments arecorrect. In effect, the institution may need to beable to demonstrate through a physicalreference base that each claimed article recordis valid. This will not differ from the RA2 formarchives currently held by most institutions, sowill not reduce the workload. The metrics’system will add the work of assignment,verification and validation.

If self-citation becomes an issue, then theinstitutional costs of checking will increasesignificantly, assuming that their researchersare not prepared to sign-off an electronicverification system.

Other data types

There will also be costs associated with the non-bibliometric data, including the funding andtraining data. Since this is akin to the currentRA3 and RA4 sections of the RAE returns therewill be little reduction in institutional costs inthat regard.

The linkage of these data to individuals will be anadditional cost if the funding and training activityneeds to be assigned ad personam rather thanas a return for a unit as a whole.

Peer involvement

It is understood that the metrics outcomeswould not standalone but would continue to besubject to some degree of scrutiny by an expertpanel. The range of information that such apanel would receive is unclear but it has to beassumed that it would need more than themetrics data alone if it is to have any seriousrole. The alternatives are either the rubber-stamping of some opaque final tables falling outof the data analysis or a complex auditing andsecond-guessing of the same analyses.

The obvious reference material would bestatements by the institutions about their recentresearch performance and strategy at the levelof the broad subject groups, i.e. something onthe lines of the RA5/RA6.

Overall, therefore, the metrics system will drawon the same data from RA2, RA3 and RA4 as atpresent (but with data reconciled to individualslisted in RA1) and will then go through a finalpeer scrutiny drawing on something like theRA5/6. One might reflect that, despite all thesuperficial change, it should have a reassuringlyfamiliar feel!



The Netherlands started to use bibliometricresearch indicators ahead of other countries.There is evidence of emergent behaviouralchanges amongst its researchers, with anexceptional rise in output and citation share. Bycontrast, UK share of output has recently fallenslightly but at no detriment to the averageresearch quality recorded in Office of Scienceand Innovation performance indicators.

If there are emergent effects then some canundoubtedly be addressed by adding modificationsto the metrics, but this risks the development of anincreasingly Byzantine and qualified system thatloses not only simplicity (hence, ease of operation)but also transparency and so leads to a loss ofinstitutional and researcher confidence.

Possible behavioural changes

Research volume is a poor indicator becausequality, not quantity, is the objective and oncevolume is used as an indicator (as in RAE1986) itbegets spurious publications. The ‘Dutch effect’is partial evidence of this at a national level.

Citation volume is equally a poor indicator becauseit is linked to output volume and does not of itselfprove anything. Since citation rates vary betweensub-fields there will always be differences incitation accumulation. If volume were an indicatorthen this would encourage lower-citing fieldsmeaninglessly to emulate high-citing fields.

Emerging behavioural effects

There is a risk that any metrics exercise may beintrinsically self-defeating, because it dependson indicators as proxies for the activity ofinterest25. Once an indicator is made a target forpolicy, it starts to lose the information contentthat originally qualified it to play such a role.There is room for manipulation, there may beemergent behavioural effects and the metricsonly capture part of the research process and itsbenefits.

It is facile to pretend that all behavioural effectscan be anticipated and modelled. The metricssystem will be assaulted, from the day it ispromulgated, by 50,000 intelligent and motivatedindividuals deeply suspicious of its outcomes.There will be consequences.

If citations per paper are used then this willpotentially affect citation behaviour across thesystem. The Netherlands started to usebibliometric indicators much earlier than mostother European Union nations, and this hashelped to support the academic development ofscientometrics in that country. But it had a widereffect on the publication and citation behaviourof the Dutch as well. Output relative to the rest ofthe world has gone up by a factor unmatched byany other European Union country. Citationshare has increased as well, partly due to outputgrowth and partly to awareness of citationmetrics as an evaluation criterion.

Figure 11:

Relative growth of output share

in European Union nations,

1981-2006.

5Other issues

35

1981 1986 1991 1996 2001 2006

1.6

1.4

1.2

1.0

0.8

Denmark

France

Germany

Netherlands

UK

Ou

tpu

t g

row

th r

ela

tive

to

wo

rld


Analysis: Evidence


36

Differentiating excellence

How well do any of the indicators allow good tobe separated from bad and good from very good?

It is broadly agreed that normalised citationimpact is a reasonably good way ofdifferentiating research quality. But it does notwork well on very small samples, whereindividual outlier papers may distort the result(as shown in our impact profile analysis). Theexample given earlier in this report, for researchcouncil data, shows why profiling is likely to be abetter discriminator than averages. Thisrequires further evaluation.

The strategy for normalisation of citation countsmay be the most important influence on anydifferentiation or submergence of relativeexcellence within any research area. Theexamples given in this report show why thechoice of level for normalisation and choice ofstrategy for data aggregation could haveprofound effects on the measured relativeperformance of sub-fields within broad subjectgroups. It is vital that this is properly evaluatedbefore any action is taken.

Benchmarking

It seems extremely unlikely that researchmetrics, which will tend to favour some modes ofresearch more than others (e.g. basic overapplied), will prove sufficiently comprehensiveand acceptable to support quality assurancebenchmarking for all institutions.

At present, almost all higher educationinstitutions are content to let their researchreputations rest – with caveats and qualifications– on RAE grades. Will they make a similar choiceof RAE metrics?

If even a few institutions choose to present theirresearch in terms other than those identified byHEFCE’s metrics then this raises a challenge tocredibility. Can both presentations be correct? Ifan institution sincerely believes that it issupporting research that is better evaluatedthrough other measures than the fundingcouncil is using then that will presumably raisequestions at researcher level; about the ‘equity’of the system. It may raise question about the‘fair’ distribution of resources betweeninstitutions and about the distribution ofresources between researchers withininstitutions if metrics are seen to work morefavourably for some research programmes.

Journal impact factors are a poor indicatorbecause of the variation in citation rate. If theywere used, then there would be an erroneouscompetition to get any article into a high-impactjournal, even if this were not the best medium forthe output. Practitioner journals would certainlysuffer but there would be disruption ofcoherence within fields as the existingassortment of material by medium wasdisturbed.

Uncited papers are a poor index. The relativevolume of uncited papers is interesting so longas it is seen as a partial and system-levelmeasure. If used at a local level in a model itwould simply lead to a systematic tendency toensure that every individual and institutionaloutput was cited at least once, whether for goodreason or not.

Output diversity is a potentially useful indicator,but could be disrupted by the generation ofspurious diversity in publication patterns.

Removing self-citations from analyses could beone way of moderating the ‘Dutch effect’, butthere are sound reasons not to attempt this.Self-citation is a normal part of research culture.If self-citation was actively penalised by HEFCEmetrics then this could lead to a change incitation behaviour, with transitional drops incitation rates, a failure to track links in researchprogrammes and a loss of international prestige.Certainly, Office of Science and Innovationindicators of the UK’s relative internationalstanding will drop.

Partitioning credit for collaborative paperswould also be ill advised. Collaboration is anincreasingly important part of research andprovides signal benefits. A significant part of theUK’s best research outputs are internationallycollaborative. To send messages to the systemthat there is a ‘metrics cost’ in collaborationwould undermine the very things that the Officeof Science and Innovation, the research councilsand a recent House of Commons report areseeking to stimulate.

In all this, it should be recognised that it will notbe possible to detect changes in UK behaviourand outcomes for some years. By then, the UKmay be set on a pathway from which it is difficultto extricate itself.


18 HEFCE August 2006/32 web only issues

paper Selection of staff for inclusion in RAE

2001

19 HEFCE has indicated a priori that it

anticipates somewhere in the region of 5-8

groups to cover the sciences, engineering,

technology and medicine.

20 Adams J [1998] ‘Benchmarking

international research’, Nature, 396, 615-

618.

21 Adams [1998] op cit and in a recent EPSRC

report

22 Adams J, Gurney K A and Jackson, L [2007]

‘Calibrating the zoom: a test of Zitt’s

hypothesis’, Scientometrics, 75 (1), in

press.

23 Zitt M, Ramana-Rahary S and

Bassecoulard E [2005] ‘Relativity of citation

performance and excellence measures:

From cross-field to cross-scale effects of

field-normalisation’, Scientometrics, 63

373-401.

24 Adams J, Bailey T, Jackson L, Scott P,

Pendlebury D and Small H. [1998]

Benchmarking of research in England,

Report to HEFCE & CPSE, University of

Leeds. ISBN 1 901981 04 5).

25 Goodhart C, [1975] ‘Problems of monetary

management: the U.K. experience’, in:

Courakis A (ed), [1981] Inflation,Depression and Economic Policy in theWest (Totowa).


6 Evidence/ESRC [2004] Bibliometricprofiles for selected Units of Assessmenthttp://www.esrcsocietytoday.ac.uk/ESRCI

nfoCentre/Images/Bibliometric%20Profil

es%20for%20RAE%20Outputs%20in%20t

he%20Social%20Sciences_tcm6-

18357.pdf

7 see also Van Raan A J [2006] ‘Comparison

of the Hirsch-index with standard

bibliometric indicators and with peer

judgment for 147 chemistry research

groups’, Scientometrics 67, 491-502 and

Editorial [2007] ‘Achievement index climbs

the ranks’, Nature 448, 737.

8 Garfield E [1955] ‘Citation Indexes for

Science: A New Dimension in

Documentation through Association of

Ideas’, Science, 122, 108-111.

9 Evidence/OSI [2007] PSA target metricsfor the UK research basehttp://www.evidence.co.uk/downloads/OS

IPSATargetMetrics070326.pdf

10 See: Adams et. al. [2007] op cit.

11 This is derived from Adams et. al. [2007]

op cit.

12 notably in extensive work by Professor S

Harnad, University of Southampton

13 Fowler J H and Aksnes D W [2007] ‘Does

self-citation pay?’ Scientometrics, 72,

427-437

14 Fowler and Aksnes [2007] op cit.

15 As did Fowler and Aksnes [2007] op cit.

16 See Adams J [2005] ‘Early citation counts

correlate with accumulated impact’,

Scientometrics, 65, 567-581.

17 Van Raan A [2004] ‘Sleeping Beauties in

Science’, Scientometrics, 59, 467-472

1 Bibliometrics are indicators of research

performance based on data associated

with journal articles. Research

publications normally refer to (or cite)

prior work which serves as an authority

for established knowledge, methodologies

and so on. Publications may then be cited

by later outputs. Citations therefore

provide a network of association between

items within the accepted corpus of

knowledge. Researchers generally agree

that more frequently cited items have a

greater significance than those that

remain uncited. Eugene Garfield, the

founder of the Institute of Scientific

Information, proposed that citation counts

should be seen as an index of ‘impact’

within the relevant field of research. That

‘impact’ has later been interpreted as

related to quality, with highly cited papers

having greater impact and therefore being

of higher quality than less frequently cited

items.

2 The Government has indicated that

mathematics and statistics are not

included for the first phase of the shift to

metrics.

3 Goodhart C A E [1984] Monetary Theoryand Practice, Macmillan: Basingstoke.

4 See Leydesdorff L and Bensman S [2006]

‘Classification and Powerlaws: The

Logarithmic Transformation’, Journal ofthe American Society of InformationScientists and Technologists, 57, 1470-

1486; Adams J, Gurney K and Marshall S

[2007] ‘Profiling citation impact: a new

methodology’, Scientometrics, 72, 325-

344.

5 see Adams J [2007] ‘Nobody expects the

Spanish Inquisition’, Research Fortnight,25 April 2007, 18-19.

Notes

37


38


40


100%

TT-COC-002444

This product has been manufactured on paper fromwell managed forests and other controlled sources.It is manufactured using the FSC Chain of Custodyand by a company employing the ISO14001environmental standard.


About Universities UK

This publication has beenproduced by Universities UK,which is the representative bodyfor the executive heads of UKuniversities and is recognised asthe umbrella group for theuniversity sector. It works toadvance the interests ofuniversities and to spread goodpractice throughout the highereducation sector.

Universities UKWoburn House20 Tavistock SquareLondonWC1H 9HQ

telephone+44 (0)20 7419 4111

fax+44 (0)20 7388 8649

[email protected]

webwww.UniversitiesUK.ac.uk

© Universities UKISBN 978 1 84036 165 4October 2007


The use of bibliometrics to measure research...

Documents

Transcript of The use of bibliometrics to measure research...