Ranking influential communities in networks · David Selby is supported by EPSRC grant EP/M508184/1...

1
biomedical sciences chemistry & physics medicine zoology economics mathematics informatics statistics rheumatology textiles 0.5 1.0 1.5 2.0 1st 20th 40th 60th 80th 100th biomedical sciences chemistry & physics medicine neuroscience zoology geology economics mathematics psychology oncology informatics psychiatry parasitology environmental science agriculture surgery politics management sociology material science education astrophysics food science statistics energy radiology spectroscopy orthopaedics linguistics human geography particle physics veterinary medicine nephrology ecology biomaterials obstetrics operational research toxicology dentistry sports medicine opthalmology rheumatology complexity robotics rhinology marketing electronics law reviews social history anthropology urology communication dermatology logistics electrical engineering information systems bibliometrics tribology history of science computer graphics social anthropology psychometrics plastic surgery learning disabilities policy geotechnology social work international law logic philosophy vascular surgery social justice radioactivity tourism forensics mycology transfusion engineering management mathematical finance measurement Pharm Stats Stats in Medicine Pakistan J Stats Comp Stats ANZ J Stats Comm Stats: Sim & Comp Comp Stats & Data Analysis J Applied Stats JRSS-A JRSS-C J Time Series Appl Stoch Models in Business J Machine Learning Res J Multivariate Analysis Machine Learning JASA Stats & Comp Bernoulli Methods & Comp in Appl Prob J Agr Bio & Envir Stats Brazil J Prob & Stats Comm Stats: Theory & Methods Stats Papers Stats & Prob Letters Statistics Test J Korean SS Rev Colomb Estad J Stat Plan & Infer Ann Inst Stat Maths J Stats Comp & Sim Annals Biometrika EJ Stats ESAIM Prob & Stats Extremes JRSS-B Metrika Prob Eng & Inf Sci Scand Actuarial J Scan J Stats Seq Analysis Stoch Models Adv Data Analysis & Class J Stat Software Stat Methods & Appl Qual Rel Eng Int Technometrics Biometrics Stats in Biopharm Res Ann Appl Stats Stat Sci Environometrics Stat Meth Med Res Bayesian Analysis Biometrical J Biostatistics Int J Biostatistics J Biopharm Stats Lifetime Data Analysis Stats & Interface American Stat Envir & Eco Stats Int Stats Rev JCGS Stat Method Stat Model Canada J Stats J Nonpar Stats J Qual Tech Adv Stat Analysis REVSTAT Qual Tech & Quant Mgmt R Journal Stat Neerlandica SORT Qual Eng Statistics 77 journals 29,000 citations Ranking influential communities in networks Ranking journals How to measure influence using random walks The PageRank algorithm models the behaviour of a typical PhD student. 1. Open a random journal. 2. Pick a random reference and open the cited journal. 3. Repeat, ad nauseum. PageRank is the proportion of time spent reading each journal, i.e. the stationary distribution of an ergodic Markov chain. It is a measure of total influence. PageRank has a size bias: bigger journals have more/longer articles in them, attracting more citations. What if we want to measure prestige, rather than popularity? The Scroogefactor score, defined as PageRank per reference, controls for this size bias. It measures influence weight per outgoing citation. Like the Bradley–Terry model, journals are, in effect, penalised for being generous with citations and rewarded for being miserly. When the Bradley–Terry model fits exactly, a journal's Scroogefactor is exactly equal to its Bradley–Terry score. What can citation data reveal about the flow of influence in scientific fields? Web of Science 10,713 journals 112 communities 9.1 million citations Finding communities How to group journals into fields using random walks The Infomap algorithm (Rosvall & Bergstrom 2008; PNAS 105) looks for a clustering of nodes that gives the shortest possible description of a random walk around the network. Every node has a twolevel address, identifying a community and the node's position within that community. It is like a street address, comprising a street name (position) and city name (group). Street names are often reused between cities, but every street–city pair is unique. By aggregating the journals in each community into a single ‘superjournal’, we can model the exchange of citations between disciplines. Acknowledgements David Selby is supported by EPSRC grant EP/M508184/1 and David Firth is supported by the EPSRC ilike programme. Thanks to Thomson Reuters for providing Journal Citation Reports data in a convenient format. R code and full output from the analysis is available on GitHub: https://github.com/Selbosh/user2017 Above: a betweenfields ranking. Applied, general disciplines tend to have higher Bradley–Terry ability scores than highly specialised or theoretical ones. Below: a withinfield ranking for statistics journals. A high score implies the journal is influential within statistics, but not necessarily influential on other fields. JRSS-A American Stat JRSS-B Annals Biometrika JCGS Machine Learning Stats in Medicine Metrika Statistics R Journal 1 2 1st 20th 40th 60th Statistical model Given a set of paired comparisons, the Bradley–Terry model estimates an ability score for each object, such that for any pair of objects i and j. Citations between academic journals can be treated as paired comparisons: being cited means being an ‘exporter of intellectual influence’ (Stigler 1994; Statistical Science). Using ability scores, we can predict the probability that journal i cites journal j more than j cites i. Influential journals are more likely to be cited by other influential journals. P(i beats j ) 1 - P(i beats j ) = μ i μ j Right: the Web of Science network. Each node represents a community of journals. The edges represent interdisciplinary citations. Below: journals and citations within the statistics community. David Selby [email protected] David Firth [email protected]

Transcript of Ranking influential communities in networks · David Selby is supported by EPSRC grant EP/M508184/1...

Page 1: Ranking influential communities in networks · David Selby is supported by EPSRC grant EP/M508184/1 and David Firth is supported ...

biomedical sciences

chemistry & physics

medicine

zoology

economics

mathematicsinformatics

statisticsrheumatology

textiles0.5

1.0

1.5

2.0

1st20th40th60th80th100th

biomedical sciences

chemistry & physics

medicine

neuroscience

zoologygeology

economics

mathematics

psychology

oncology

informatics

psychiatry

parasitology

environmental science

agriculture

surgery

politics

management

sociology

material science

education

astrophysics

food science

statistics

energy

radiology

spectroscopy

orthopaedics

linguistics

human geography

particle physics

veterinary medicine

nephrology

ecology

biomaterials

obstetrics

operational research

toxicologydentistry

sports medicine

opthalmology

rheumatology

complexity robotics

rhinology

marketing

electronics

law reviews

social history

anthropology

urology

communication

dermatology

logistics

electrical engineering

information systems

bibliometrics

tribologyhistory of science

computer graphics

social anthropology

psychometrics

plastic surgery

learning disabilities

policy

geotechnology

social work

international law

logic

philosophy

vascular surgery

social justice

radioactivity

tourism

forensics

mycologytransfusion

engineering management

mathematical finance

measurement

Pharm Stats

Stats in Medicine

Pakistan J Stats

Comp Stats

ANZ J Stats

Comm Stats: Sim & Comp

Comp Stats & Data Analysis

J Applied Stats

JRSS-A

JRSS-C

J Time Series

Appl Stoch Models in Business

J Machine Learning Res

J Multivariate Analysis

Machine Learning JASA

Stats & Comp

Bernoulli

Methods & Comp in Appl Prob

J Agr Bio & Envir StatsBrazil J Prob & Stats

Comm Stats: Theory & Methods

Stats Papers

Stats & Prob Letters

Statistics

Test

J Korean SS

Rev Colomb Estad

J Stat Plan & Infer

Ann Inst Stat Maths

J Stats Comp & Sim

Annals

Biometrika

EJ Stats

ESAIM Prob & Stats

ExtremesJRSS-B

Metrika

Prob Eng & Inf SciScand Actuarial J

Scan J Stats

Seq Analysis

Stoch Models

Adv Data Analysis & Class

J Stat Software

Stat Methods & Appl

Qual Rel Eng Int

Technometrics

Biometrics

Stats in Biopharm Res

Ann Appl Stats

Stat Sci

Environometrics

Stat Meth Med Res

Bayesian Analysis

Biometrical J

Biostatistics

Int J Biostatistics

J Biopharm Stats

Lifetime Data Analysis

Stats & Interface

American Stat

Envir & Eco Stats

Int Stats Rev

JCGS

Stat Method

Stat Model

Canada J Stats

J Nonpar Stats

J Qual Tech

Adv Stat Analysis

REVSTAT

Qual Tech & Quant Mgmt

R Journal

Stat Neerlandica

SORT

Qual Eng

Statistics77 journals29,000 citations

Ranking influential communities in networks

Ranking journalsHow to measure influence using random walks

The PageRank algorithm models the behaviour of a typical PhD student.

1. Open a random journal.

2. Pick a random reference and open the cited journal.

3. Repeat, ad nauseum.

PageRank is the proportion of time spent reading each journal, i.e. the stationary distribution of an ergodic Markov chain. It is a measure of total influence.

PageRank has a size bias: bigger journals have more/longer articles in them, attracting more citations. What if we want to measure prestige, rather than popularity?

The Scroogefactor score, defined as PageRank per reference, controls for this size bias. It measures influence weight per outgoing citation.

Like the Bradley–Terry model, journals are, in effect, penalised for being generous with citations and rewarded for being miserly. When the Bradley–Terry model fits exactly, a journal's Scroogefactor is exactly equal to its Bradley–Terry score.

What can citation data reveal about the flow of influence in scientific fields?

Web of Science10,713 journals

112 communities9.1 million citations

Finding communitiesHow to group journals into fields using random walks

The Infomap algorithm (Rosvall & Bergstrom 2008; PNAS 105) looks for a clustering of nodes that gives the shortest possible description of a random walk around the network.

Every node has a two­level address, identifying a community and the node's position within that community. It is like a street address, comprising a street name (position) and city name (group). Street names are often re­used between cities, but every street–city pair is unique.

By aggregating the journals in each community into a single ‘super­journal’, we can model the exchange of citations between disciplines.

AcknowledgementsDavid Selby is supported by EPSRC grant EP/M508184/1 and David Firth is supported by the EPSRC i­like programme.

Thanks to Thomson Reuters for providing Journal Citation Reports data in a convenient format.

R code and full output from the analysis is available on GitHub:https://github.com/Selbosh/user2017

Above: a between­fields ranking. Applied, general disciplines tend to have higher Bradley–Terry ability scores than highly specialised or theoretical ones.

Below: a within­field ranking for statistics journals. A high score implies the journal is influential within statistics, but not necessarily influential on other fields.

JRSS-AAmerican StatJRSS-B

AnnalsBiometrika

JCGSMachine Learning

Stats in Medicine

MetrikaStatistics

R Journal

1

2

1st20th40th60th

Statistical modelGiven a set of paired comparisons, the Bradley–Terry model estimates an ability score for each object, such that

for any pair of objects i and j.

Citations between academic journals can be treated as paired comparisons: being cited means being an ‘exporter of intellectual influence’ (Stigler 1994; Statistical Science).

Using ability scores, we can predict the probability that journal i cites journal j more than j cites i. Influential journals are more likely to be cited by other influential journals.

P(i beats j)1− P(i beats j)

=µi

µj

Right: the Web of Science network. Each node represents a community of journals. The edges represent interdisciplinary citations.

Below: journals and citations within the statistics community.

David [email protected]

David [email protected]