Overview of Bibliometrics - IAP Course version 1.1

82
Overview of Citation Analysis Clickstream Data Yields High-Resolution Maps of Science. By Johan Bollen, Herbert Van de Sompel, Aric Hagberg, Luis Bettencourt, Ryan Chute, Marko A. Rodriguez, Lyudmila Balakireva. Public Library of Science ONE, March 11, 2009. Version: 5/6/14

description

Whose articles cite a body of work? Is this a high-impact journal? How might others assess my scholarly impact? Citation analysis is one of the primary methods used to answer these questions.

Transcript of Overview of Bibliometrics - IAP Course version 1.1

Page 1: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Clic

kstr

eam

Dat

a Yi

elds

Hig

h-Re

solu

tion

Map

s of

Sci

ence

. By

Joha

n Bo

llen,

Her

bert

Van

de

Som

pel,

Aric

Hag

berg

, Lui

s Bett

enco

urt,

Ryan

Chu

te, M

arko

A. R

odrig

uez,

Lyu

dmila

Bal

akire

va. P

ublic

Li

brar

y of

Sci

ence

ON

E, M

arch

11,

200

9.

Version: 5/6/14

Page 2: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Micah AltmanDirector of Research

MIT Libraries

Sean ThomasProgram Manager for Scholarly Repository Services and the Product

Manager of DSpace@MIT

Prepared for

IAPril

MIT

April 2014

Page 3: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

DISCLAIMERThese opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators

Secondary disclaimer:

“It’s tough to make predictions, especially about the future!”

-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R.

Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.

Version: 5/6/14

Page 4: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Collaborators & Co-Conspirators

• Thanks to:– Michael Noga– Peter Cohn– Courtney Crummett

Version: 5/6/14

Page 5: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Related Work• K. Smith-Yoshimura, et al., 2014, Registering Researchers in

Authority Files, OCLC Research. • Liz Allen, Jo Scott, Amy Brand, Marjorie M.K. Hlava, Micah Altman

(Forthcoming), Beyond authorship: recognising the contributions to research; Nature.

• Data Synthesis Task Group. 2014. Joint Principles for Data Citation.• CODATA Data Citation Task Group, 2013. Out of Cite, Out of Mind:

The Current State of Practice, Policy and Technology for Data Citation. Data Science Journal. 2013;12:1–75.

Slides and reprints available from:informatics.mit.edu

Version: 5/6/14

Page 6: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

And now, a word from our sponsor…The Libraries @ MIT

The MIT libraries provide support for all researchers at MIT:

• Research consulting, including:bibliographic information management; literature searches; subject-specific consultation

• Data management, including:data management plan consulting; data archiving; metadata creation

• Data acquisition and analysis, including:database licensing; statistical software training; GIS consulting, analysis & data collection

• Scholarly publishing:open access publication & licensing

libraries.mit.eduVersion: 5/6/14

Page 7: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Roadmap

* Background * * Metrics *

* Data ** Tools *

* Data Processing * * Putting it all together *

* Resources *

Version: 5/6/14

Page 8: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Background(Why?)(What?)(Which?)

Version: 5/6/14

Page 9: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

What are bibliometrics?(simple definition)

Bibliometrics are measures of scholarly outputs.

Version: 5/6/14

Page 10: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Scholarly output effects reputation, ranking, and funding of the discipline, institution, and individual scholar

We initially use bibliometric analysis to look at the top institutions, by publications and citation count for the past ten years…

Universities are ranked by several indicators of academic or research performance, including… highly cited researchers…

Citations… are the best understood and most widely accepted measure of research strength.

Version: 5/6/14

Page 11: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Then

Clarke, Beverly L. "Multiple authorship trends in scientific papers." Science 143.3608 (1964): 822-824.

Version: 5/6/14

Page 12: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Now

Version: 5/6/14

Page 13: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Now is More

Version: 5/6/14

Page 14: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

What are bibliometrics?(Extended Definition)

• Analysis of characteristics of/relationships amongresearch/scholarly outputs/publications

– Analysis includes: lists, descriptive statistics, visualization, inference

– Outputs include:grants, articles, books, databases, software, patents

Version: 5/6/14

Page 15: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Which questions are bibliometrics being used to answer?

Some examples:

• What are the most influential journals in a particular field?

• How influential is this scholar?• Where is interdisciplinary research occurring?• Which groups of people effectively collaborate?• Which institutions are using funding most

productively?

Version: 5/6/14

Page 16: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Data

(Leading Databases)(Subject-Specific)(MIT Internal)(Selection)

Version: 5/6/14

Page 17: Overview of Bibliometrics - IAP Course version 1.1

Google Scholar

Data Sources• Unspecified coverage, but…• Wide coverage of books,

preprint, conference proceedings, non-english work, working papers, patents, institutional repositories

Built-in Metrics• Journal H-Index• Author Profiles

– Total & Five-Year Counts– I-10 index and H-index– Yearly citations

• Limited filtering

Overview of Citation Analysis

scholar.google.com Version: 5/6/14

Page 18: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Data• Frequently updated/current• Covers journal articles

published after 1995• Wide disciplinary coverage• Includes theses and patents,

and citations from these • Includes some institutional

repositories• Commercial

Metrics• Citation lists & counts• Author impact & articles

– Statistics – Metrics – Graphs

• Journal impact – Statistics– Metrics– graphs

scopus.com

Scopus

Version: 5/6/14

Page 19: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Data• Journal coverage after 1899• Many conference proceedings

since 1990• Many books since 2005• Limited coverage of non-

english works• Doesn’t index institutional

repositories and e-print servers• Commercial

Metrics• Citation lists & counts• Author impact & articles

– Statistics – Metrics – Graphs

• Journal impact – Statistics– Metrics– graphs

apps.webofknowledge.com

Web of Science

Version: 5/6/14

Page 20: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Major Subject Specific Catalogs With Citation Metrics• SciFinder:

chemical abstracts scifinder.cas.org

• PsycInfo: psychological literaturewww.apa.org/pubs/databases/psycinfo/

• Business Source Complete:business articleswww.ebscohost.com/academic/business-source-complete

• arXiv: physics, mathematics, nonlinear sciences, computer science, quantitative biology, quantitative finance, statistics (Integrates w/NASA-ADS and INSPIRE)arxiv.org

• mathSciNetMathematical Reviews. Computes collaboration distances.www.ams.org/mathscinet/

• IEEE Digital Librarycontent published by the IEEE including citing references

• USPTO: find patents that are cited by/cite othersuspto.gov/patft/

• ACM Digital LibrariesFull text and citation of ACM articles and proceedingsdl.acm.org

VERA: owens.mit.edu/sfx_local/az/mit_db Version: 5/6/14

Page 21: Overview of Bibliometrics - IAP Course version 1.1

APIs for Scholarly Resources

What are API’s?

• Application programming interface (APIs), are tools used to expose raw data, query interfaces, or other functions to other software applications

• Typically more flexible than interactive interfaces

Challenges• Requires programming• Requires data manipulation and

reorganization• Variety of interfaces, coverage, results

and terms of service

Overview of Citation Analysis

libguides.mit.edu/apis

Version: 5/6/14

Page 22: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Using API’sChoosing tools

• Recommend python or R• Many resources such as

PUBMED, DataVerse, and arXiv are accessible through OAI-PMH protocol

• More in tools section and resources section

Example: Harvesting ArXiv with pyoai

Version: 5/6/14

from oaipmh.client import Clientfrom oaipmh.metadata import MetadataRegistryfrom lxml import etree

URL = 'http://export.arxiv.org/oai2’registry = MetadataRegistry()

class Reader(object): def __call__(self, element): return etree.tostring(element, pretty_print=True, encoding='UTF8')

registry.registerReader('oai_dc', Reader())

client = Client(URL, registry)

for count, record in enumerate(client.listRecords(metadataPrefix='oai_dc')): header = record[0] metadata = record[1] or '’ print header.identifier() print metadata

Page 23: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

MIT Internal DataInstitute Data (Restricted Use)• IS&T DataWarehouse

Data from administrative systems. E.g. MIT people, organizations, grants and awards

ist.mit.edu/warehouse

• Office of the Provost – Institutional Research

Provides analytical and research support to the Provost, academic departments, research laboratories and centers.

web.mit.edu/ir/

Libraries Data • DSpace@MIT

lists of publications in Dspace by author/department

dspace.mit.edu

• Barton

lists of MIT these by author/advisor

library.mit.edu

Version: 5/6/14

Page 24: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Comparing Databases

Coverage• Years• Disciplines• Publishers/sources• Venue –

journals/conferences/working paper/IR/personal web sites

• Documentation of coverage• Completeness

Characteristics• Internal vs. external• Free vs. fee-based• API vs. interactive• Open data vs. restrictive

licensed• Structured vs. unstructured • Full text vs. metadata

Version: 5/6/14

Page 25: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Selecting a Database

• Free, quick, and useful Google Scholar• Extract data for further simple analysis

Scholarometer (google scholar extract), Scopus, WOS

• More complete coverage use multiple databases

• Specialized subject/single article disciplinary database/API

• Extract data for network analysis API

Free & Easy

$$ and/or programmatic

Version: 5/6/14

Page 26: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Measures(Article metrics)(Author Impact)(Journal Impact) (Collaboration)(Network Analysis)

Version: 5/6/14

Page 27: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Article Metrics: Overview

What are article-level metrics?

• Measures on specific published articles

• Typically used in construction of literature reviews; or as building blocks for other measures

Common measures• Citations list• Citation counts• References• Captures/bookmarks• Downloads• Mentions• Likes• Views• Readers

sparc.arl.org/sites/default/files/sparc-alm-primer.pdf Version: 5/6/14

Page 28: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Article Metrics: Using Google Scholar

Steps1. Go to scholar.google.com 2. Search (Full Text + Metadata)

– Unstructured keyword search OR

– “Advanced” fielded search

3. Sort– by relevance

OR– ny date

4. Filter– By Date range AND/OR– By Corpus (case law, patents)

Results• Number of citations to

article indexed google scholar

• List of citing articles• Article text

(sometimes)

Version: 5/6/14

Page 29: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Article Metrics: Example – Google Scholar

Version: 5/6/14

Page 30: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Article Metrics: Altmetrics

Types• Captures/bookmarks• Downloads• Mentions• Likes• Views• ReadersSources• Social media• Reference management

(e.g. citeulike, mendeley )• Indexes/searches

(e.g. Scopus)

Sources• PLOS article metrics

article-level-metrics.plos.org

• Plum Analyticsplumanalytics.com

• ImpactStoryimpactstory.org

Version: 5/6/14

Page 31: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Article Metrics: Database Comparison

Google Scholar,Scopus,WOS

PLOS

Plos Articles Only

PlumX

Coverage Wide variety PLOS Articles Wide Variety

Measures Citation countCitation list

Citation countCitation listViewsDownloadsMentionsBookmarksComments

Citation countCitation listViewsDownloadsMentionsBookmarksComments

Version: 5/6/14

Page 32: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

‘Impact’ Factors: Overview

What are impact factors?• Descriptive statistics • Usually based on citations• Commonly treated as a

proxy for the level of influence of an article, person, or journal

Common measures• ISI Journal Impact Factor:

The frequency with which the “average article” has been cited in a particular year. It is based on the most recent two years of citations. It is only supplied for journals indexed by ISI in the Web of Science.

• Article Citation Count:

Total number of citations received from other articles to target article.

• H-Index:

The maximum number of articles h such that each has received at least h citations

libraries.mit.edu/scholarly/publishing/impact-factors/ Version: 5/6/14

Page 33: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Author Impact: Example – Google Scholar

Version: 5/6/14

Page 34: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Author Impact: Example – Exporting Data with Scholarometer

Version: 5/6/14

Page 35: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Author Impact: Example – Web of Science

Version: 5/6/14

Page 36: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Author Impact: Database Comparison

Google Scholar Scholar+Scholarometer

Scopus Web of Science

Select Any Author

Only w/profiles Yes Yes Yes

Export data No Yes Yes Yes

Exclude articles No Yes Yes Yes

Metrics H-index,I10,num cites

H-index,I10,num cites

H-index,… H-index

Visualization Minimal Minimal Yes Yes

Longitudinal Minimal Minimal Yes Yes

Version: 5/6/14

Page 37: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Journal Impact: Using Online Services

Scholar

1. Go to scholar.google.com

2. Click on METRICS

3. Google rank and journal h-5 factor displayed

4. Filter by country & field

Scopus• Go to

scopus.com • Click on

Journal Analyzer

• Select journal• Select statistics

Web of Science1. Go to admin

-apps.webofknowledge.com/JCR/

2. Select field and year + SUBMIT

3. Select subject + SUBMIT

Version: 5/6/14

Page 38: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Journal Impact: Example – Google Scholar

Version: 5/6/14

Page 39: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Journal Impact: Example – Web of Science

Version: 5/6/14

Page 40: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Journal Impact: Example – Scopus

Version: 5/6/14

Page 41: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Journal Impact: Database Comparison

Google Scholar Scopus Web of Science

Journals Covered Top 100 ranked in each language

Mostly english-language Many (selected) Journals

Metrics H5 Median Many Impact factor, Many others

Visualization No Yes Yes

Longitudinal analysis

No Yes Yes

Discipline Rankings No No Yes

Version: 5/6/14

Page 42: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Network Analysis

What is network analysis?• Study of objects and

interactions modeled as an induced network (or graph)

• Units of observation form nodes

• Relationships form edges

Common measures• Community detection

– Modularity– Clustering– Clique

• Centrality– Betweeness– Degree– Closeness

• Diameter• Visualization

Version: 5/6/14

Page 43: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Network Analysis: Example – CitNetExplorer

Version: 5/6/14

Page 44: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Network Analysis: Example – CitNetExplorer

1. Use WOS to locate records2. Add records to “marked list”3. Click “marked list”4. Check “cited references”5. Save to other file formats6. Select windows tab delimeted7. Open in CitNetExplorerVersion: 5/6/14

Page 45: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

CoAuthorship Analysis Example – Using R and JSTOR – Part 1

Version: 5/6/14

Page 46: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

% cut -d"," -f 1-11 citations.CSV >areastudies2003.csv

R> areastudies.df< read.table(file="citations.CSV",row.names=NULL,sep=",",quote="",stringsAsFactors=F,header=T)R> authorList <- strsplit(areastudies.df$author,perl=TRUE,split="\t")R> plot(table(sapply(authorList,length)))

CoAuthorship Analysis Example – Using R and JSTOR – Part 2

Version: 5/6/14

Page 47: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

createCoauthorlist<-function(pl){ coauthors<-list() updateCoauthor<-function(co,paperAuthors) { tmp <- unlist( coauthors[co] ) tmp <- union(tmp,unlist(paperAuthors)) coauthors[[co]]<<-tmp } sapply(pl, function(x)sapply(x,function(y)updateCoauthor(y,x))) return (coauthors)}

CoAuthorship Analysis Example – Using R and JSTOR – Part 3

R> R> coa<-

createCoauthorlist(authorList)R> plot(table(sapply(coa,length)))

Note: Results are biased down, if a sample of records is used!

Version: 5/6/14

Page 48: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Variations: Retrieving Authors from PLOSlibrary(rplos)options(PlosApiKey= “YOURKEY")

fetchPlosResults<-function(qstring, fstring,start=0) { moreResults <- TRUE results.df <- NULL batStart<-start batSize <- 999 while (moreResults) { tmp.df <- try(silent=TRUE, searchplos(terms="*:*", toquery = qstring, fields=fstring, start=batStart, limit=batSize) ) if (class(tmp.df) == "try-error") { moreResults<-FALSE } else if (is.null(dim(tmp.df))) { moreResults<-FALSE } else if (dim(tmp.df)[1]==0) {

moreResults<-FALSE } else { results.df<-merge(tmp.df,results.df,all=TRUE) batStart <- batStart + batSize cat (paste(batStart,date(),"\n")) save(results.df,file="/tmp/plosTMP.RData")

} } return(results.df)}

plosRes.df <- fetchPlosResults( qstring= 'publication_date:[2012-01-01T00:00:00Z TO 2012-12-31T23:59:59Z]', fstring= "id,author,journal,publication_date,subject,subject_level_1,references,article_type")

Version: 5/6/14

Page 49: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Limitations

Limitations of data• Citation differs systematically from sharing,

reading, or ‘use’• Relationships signaled by citation are

heterogenous: citations may indicate evidentiary support, definitions, disagreement, kudos,…

• Cited objects are heterogenous – e.g. journals include letters, comments, reviews and original research

• Databases may have limited or inconsistent coverage of publishers, fields, years, or types of publications (e.g. conference proceedings), types of objects (databases, software, books, articles)

• Some types of objects are often used without being cited

Limitations of measures• Most measures are vulnerable

to self-citation and other sorts of manipulation

• Most measures are descriptive estimates – they are not forecasting or causal inferences

• Few studies of the external validity of measures

• Few studies on error and bias in estimators

Version: 5/6/14

Page 50: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Tools

(Built-in tools)(Analysis tools)

Version: 5/6/14

Page 51: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Built-in Tools

• Database portals have built-in tools: Google Scholar; Scholarometer; Web of Science …

• Typical restrictions of built-in tools– Single database– Number of records– Usually single-author/single journal metrics– Lacks statistical forecasting/causal models– Limited data-cleaning options– Simple visualizations

Version: 5/6/14

Page 52: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

External Tools

Feature sets• Data retrieval• Data processing

(next section)• Core statistics• Visualization• Exploratory network

analysis• Network modeling

Choosing a tool• Open vs. closed source• Free vs. commercial• GUI vs. CLI• Scalability• Single Platform/Multi-

Platform• Feature Set• Maintenance/support

Version: 5/6/14

Page 53: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Publish or Perish• Automatic data retrieval

– MS Academic Search– Google Scholar

• Standard single-author metrics – Total number of papers and

total number of citations– Average citations per paper,

citations per author, papers per author, and citations per year

– Hirsch's h-index and related parameters and variations

• Data export to CSV www.harzing.com/pop.htm

Version: 5/6/14

Page 54: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Scholarometer

Data• Google Scholar• Crowd-source tags

(disciplines) – available through API

• Data export to CSV

Metrics• Single/combined author

citation count/h-index rank• Discipline rank/• Author network

visualization• Discipline network

visualization

scholarometer.indiana.eduVersion: 5/6/14

Page 55: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

PajekAnalysis• Network visualization• Supports complex

networks: multi-relational, longitudinal, 2-mode

• Layout control• Clustering• Community detection

pajek.imfm.si

Source: www.public.asu.edu/~majansse/pubs/SupplementIHDP.htm

Version: 5/6/14

Page 56: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

CitNetExplorerFeatures• Citation/bibliometric specific

tool• Web of Science import.• Pajek export. • Large networks.

(millions of publications)• Simple network visualizations• Network measures:

connected components, clusters, core publications …

citnetexplorer.nlVersion: 5/6/14

Page 57: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

CiteSpaceFeatures• Citation/bibliometric tool• Import from

WOS, ArXiV, NSF, ADS,Pubmed• Export to CSV, GraphML, Pajek• Time slicing• Network measures: connected

components, clusters, core publications …

• Topic clustering

cluster.cis.drexel.edu/~cchen/citespace

Version: 5/6/14

Page 58: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

SciMatFeatures• Workflow support• Network visualization• Data processing and

cleanup• Longitudinal analysis • Metrics: h-index

sci2s.ugr.es/scimat/ Version: 5/6/14

Page 59: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

GephiAnalysis• Network graphs & layout• Dynamic filtering

(including time-sliders)• Clustering• SNA: betweeness,

closeness, diameter, PageRank, HITS,…

• Community detection(modularity)

gephi.org Version: 5/6/14

Page 60: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Sci2Tool

Analysis and Visualization• Temporal – burst detection• Geospatial• Topical• Networks – trees and

graphs

Additional Benefits• Parsers for citation data• Bibliometric analysis tools• Portable output files• Direct connections to R and

Gephi

http://sci2.cns.iu.eduVersion: 5/6/14

Page 61: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Command-Line ToolsUsing Python

• Scipy:scientific data processing, statistics, visualizationscipy.org

• NLTK:text processing and analysisnltk.org

• NetworkX:network measures (descriptive)networkx.github.io

• Bibtools:parse WOS data, and identify comunities of cocitationwww.sebastian-grauwin.com/?page_id=492

• PythonOAI:retrieve bibliographic metadata from OAI sources, such as arXivpypi.python.org/pypi/pyoai/

Using R

• tm:simple text processing and analysiscran.r-project.org/web/packages/tm/

• StatNet: network measures (descriptive); social network analysis (forecasting, causal); visualizationstatnet.org

• Citan: citation analysiscran.r-project.org/web/packages/CITAN

• Rplos:retrieve citation data from PLOShttp://cran.r-project.org/web/packages/rplos/

• Rmendeleyretrieve citation data from Mendeleyhttp://ropensci.org/packages/rmendeley.html

• RISmedretrieve data from NCBIhttp://cran.r-project.org/web/packages/RISmed/index.html

• OAIHarvesterretrieve data from OAI-PMH Sourcescran.r-project.org/web/packages/OAIHarvester/

Web integration for interactive visualization: d3js.org

Version: 5/6/14

Page 62: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Characteristics of Tools

• Built-in vs. external• Free vs. fee-based• Command line vs. interactive• Open source vs. closed source• Domain– Data extraction, retrieval, integration– Data cleaning and manipulation– Network visualization– Advanced measures– Statistical analysis

Version: 5/6/14

Page 63: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Choosing tools.

• Simple standard impact built-in database tools; Publish or Perish; Scholarometer

• Messy data OpenRefine + …• Network analysis measures– Network measures Sci2,SciMat, Pajek– Visualizations Gephi, Pajek, CitNet, SciMat

• Need to estimate complex statistical (predictive, statistical) models R

• Need maximum software flexibility, integration with software Python

Quick Start

Power Tools

Version: 5/6/14

Page 64: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Data Processing

(reorganizing data)(cleaning data)(matching names)

Version: 5/6/14

Page 65: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Open Refine

• Spreadsheet/database combination– Ease of use of spreadsheets– Reporting and manipulative power of databases

• Filters, facets, and clustering– Allow granular overview of what’s in your data– Easily see occurrence distribution of values– Easily make global corrections

• Supports both row-level and record-level (multi-row) operationsopenrefine.org

Version: 5/6/14

Page 66: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Open Refine – Reorganize DataReorganizing Data• Splitting/joining multi-

valued cells• Transposing rows/columns• Supports logic-based

transformation– Google Refine Expression

Language (GREL)– Clojure– Jython

openrefine.orgVersion: 5/6/14

Page 67: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Open Refine – Cleaning DataCleaning Data• Duplicate detection• Common data

transformations– Trimming whitespace– Normalizing text case

• Cluster/edit for matching and normalization

Additional Benefits• Perform mass edits

efficiently• Revision history allows for

roll-back to earlier state• Transformations recorded

as JSON– Portable for future data sets

• Browser-based

openrefine.orgVersion: 5/6/14

Page 68: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Open Refine – Matching NamesMatching names• Create filters to navigate

larger datasets• Create facets to see all

unique values/occurrences• Auto-detect variant entries• Cluster/edit for matching

and normalization• Reconciliation services

against external data for normalization/aggregation

openrefine.orgVersion: 5/6/14

Page 69: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Name Disambiguation

Methods• Dictionary-based entity

matching• Phonetic Matching• Rules-based linkage• Probability based linking

– Edit distance– Felligi-Sunter algorithm– Machine-learning

Tools• Febrl

sourceforge.net/projects/febrl/

• RecordLinkage (for R)cran.r-project.org/web/packages/RecordLinkage/

• Link-King (for SAS)the-link-king.com

Source: en.wikipedia.org/wiki/Record_linkage

Version: 5/6/14

Page 70: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Matching Names – Author Identifiers

What are Author Identifiers?

• Author identifiers give you a way to reliably and unambiguously connect your names(s) with your work throughout your career, including your papers, data, biographical information, etc. This can be helpful in a number of ways:

• Provides a means to distinguish between you and other authors with identical or similar names.

• Links together all of your works even if you have used different names over the course of your career.

• Makes it easy for others (grant funders, other researchers etc.) to find your research output.

• Ensures that your work is clearly attributed to you.

Getting started with ORCID...

• ORCID (Open Researcher and Contributor ID) is a non-prorietary, non-profit community-based registry of research identifiers.

• Links authors to their datasets and other works in addition to articles.

• Authors can control what information in their ORCID profile they share. Only the ORCID ID is automatically shared. (See their privacy policy.)

• It is easy to import research output from other sources (including ResearcherID, Scopus Author ID, and Datacite Metadata Store to your ORCID profile. (See ORCID's import works page.)

• Many organizations and publishers have created integrations with ORCID including Nature Publishing Group, Elsevier, and the American Physical Society.

• Free, private, 30-second registration:orcid.org/register

libguides.mit.edu/content.php?pid=573578&sid=4729602 Version: 5/6/14

Page 71: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Application

(Combining External and Internal Sources)

(Co-authorship Analysis)(Visualization)

Version: 5/6/14

Page 72: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Citation analysis – export citationsQuestion: For a given paper’s citing articles, what other articles were frequently cited?

Version: 5/6/14

Page 73: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Citation analysis – Open Refine

Version: 5/6/14

Page 74: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Citation analysis – Open Refine

Version: 5/6/14

Page 75: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Citation analysis – Open Refine

Version: 5/6/14

Page 76: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Resources

(Readings)(Software)(Data)

(Glossary)

Version: 5/6/14

Page 77: Overview of Bibliometrics - IAP Course version 1.1

Recommended Reading• Data Processing - General

– Getting Started:programminghistorian.org/lessons/cleaning-data-with-openrefine

– References:Verborgh, Ruben, and Max De Wilde. Using OpenRefine. Packt Publishing Ltd, 2013.

– Tutorials: github.com/OpenRefine/OpenRefine/wiki/External-Resources

• Data Processing – Dealing with Names– Getting Started -- author identifiers guide:

libguides.mit.edu/content.php?pid=573578&sid=4729602

– References:Winkler 2012; Name Matching and Record Linkages, U.S.

Censushttp://www.census.gov/srd/papers/pdf/rr93-8.pdf

Overview of Citation AnalysisVersion: 5/6/14

Page 78: Overview of Bibliometrics - IAP Course version 1.1

Recommended Reading (Continued)

• Bibliometric Analysis– Tutorials:

Anne-Wil Harzing ,2011 The Publish or Perish Book, part 3: Doing bibliometric research with Google Scholar, Tarma software press

Wouter De Nooy , et al.,2011, Exploratory Social Network Analysis with Pajek, 2nd Edition, Cambridge University Press

author identifiers guide: libguides.mit.edu/content.php?pid=573578&sid=4729602

article level metrics:sparc.arl.org/sites/default/files/sparc-alm-primer.pdf

– References:Eric D. Kolaczyk, 2009, Statistical Analysis of Network Data: Methods and Models, Springer.

Overview of Citation AnalysisVersion: 5/6/14

Page 79: Overview of Bibliometrics - IAP Course version 1.1

Available Databases & API’s

• Scholarly APIs:libguides.mit.edu/apis

• Google Scholar:scholar.google.com

• Scopus:scopus.com

• Web of science: admin-apps.webofknowledge.com

• Author identifiers: libguides.mit.edu/content.php?pid=573578&sid=4729602

• List of MIT-licensed Databases: owens.mit.edu/sfx_local/az/mit_db • Altmetrics

– PLOS article metrics article-level-metrics.plos.org– Plum Analytics plumanalytics.com– ImpactStory impactstory.org

Overview of Citation AnalysisVersion: 5/6/14

Page 80: Overview of Bibliometrics - IAP Course version 1.1

Additional Selected Tools

• OpenRefine: openrefine.org• Publish or Perish: www.harzing.com/pop.htm

• Scholarometer: scholarometer.indiana.edu

• CitNetcitnetexplorer.nl

• CiteSpace cluster.cis.drexel.edu/~cchen/citespace

• Gephi gephi.org

• Sci2 sci2.cns.iu.edu

• Pajek pajek.imfm.si

• Scimat sci2s.ugr.es/scimat/

• R Packages:– tm cran.r-project.org/web/packages/tm/– StatNet statnet.org– CITAN cran.r-project.org/web/packages/CITAN– Rplos: cran.r-project.org/web/packages/rplos/ – Rmendeley ropensci.org/packages/rmendeley.html – RISmed cran.r-project.org/web/packages/RISmed– OAIHarvester cran.r-project.org/web/packages/OAIHarvester/l

• Python Packages: – scipy scipy.org – Nltk nltk.org – networkx networkx.github.io– bibtools: www.sebastian-grauwin.com/?page_id=492 – pyOAI pypi.python.org/pypi/pyoai/

Overview of Citation AnalysisVersion: 5/6/14

Page 81: Overview of Bibliometrics - IAP Course version 1.1

Glossary of Metrics• Author H-Index:

The maximum number of articles h such that each has received at least h citations

• CentralityA measure of the importance of some node in the network based on a selected abstract model of influence/flow across network. Centrality measures include degree centrality (number of connections); closeness centrality (distance of node to other nodes in network); betweenness centrality (proportion of information that must pass through the node to go from one part of the network to another)

• (ISI Journal) Impact Factor:The frequency with which the “average article” has been cited in a particular year. It is based on the most recent two years of citations. It is only supplied for journals indexed by ISI in the Web of Science.

• Clustering:Method that partition n observations into k clusters based on the characteristics of the object. Clusters are defined either by a set of heuristics for forming the cluster, or according to a solution concept that the clusters will satisfy.

One common algorithm, K-Means assigns each observation to a fixed-K number of clusters such that each observation belongs to the cluster that has a mean value closest to that of the observation

• Network community structure measures:The detection of highly-interconnected groups of nodes within a network. Methods include hierarchical-clustering; information maximization; modularity; clique-detection

• Network Diameter:The greatest distance between any two nodes in the network.

• Page Rank:a family of iteratively-calculated recursive impact factors in which citations from other journals are weighted by the impact of those journals

Overview of Citation AnalysisVersion: 5/6/14

Page 82: Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis

Questions?E-mail: [email protected]:informatics.mit.edu

Version: 5/6/14