Download - Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Combining Full-text analysis & Bibliometric Indicators

a pilot study

Patrick Glenisson 1

Wolfgang Glänzel 1,2

Olle Persson 3

1. Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium)

2. Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary)

3. Inforsk, Department of Sociology, Umeå University, Umeå (Sweden)

Introduction

Goal: mapping of scientific processes Map of scientific papers Characterization of emerging clusters Extraction of new search keys

Using bibliometric as well as lexical indicators of ‘relatedness’

Full-text analysis

Overview

Data sources and Questions asked Text mining Ingredients Text-based relational analysis of

documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Data source 19 full-text papers from:

Scientometrics, Vol 30, Issue 3 (2004)

special issue on 9th international conference on Scientometrics and Informetrics (Beijing, China)

Validation setup Manual assignment in various classes ..

Data source

Section code Section name Paper

I Advances in Scientometrics Havemann et al. (2004)Moed and Garfield (2004)Small (2004)Yue and Wilson (2004)

II Policy relevant issues Negishi et al. (2004)Shelton and Holdrige (2004)Markusova et al. (2004)Wu et al. (2004)

III Bibliometric approaches to collaboration in science

Beaver (2004)Kretschmer (2004)Persson et al. (2004)Yoshikane and Kageura (2004)

IV Advances in Informetrics and Webometrics

Lamirel et al. (2004)Qiu and Chen (2004)Tang and Thelwall (2004)Vaughan and Wu (2004)

V Mathematical models in Informetrics and Scientometrics

Egghe (2004)Glänzel (2004)Shan et al. (2004)

Research questions

Comparison text-based mapping vs. expert classification

Extracted keywords

Comparison with bibliometric mapping

Overview



Methodology

Given a set of documents,

Methodology

<1 0 0 1 0 1>

<1 1 0 0 0 1>

<0 0 0 1 1 0>


compute a representation, called index

Methodology

<1 0 0 1 0 1>

<1 1 0 0 0 1>

<0 0 0 1 1 0>


compute a representation, called index

to retrieve, summarize, classify or cluster them

Methodology

Document processing Remove punctuation & grammatical structure (‘Bag of words’ ) Define a vocabulary

Identify Multi-word terms (e.g., tumor suppressor) (phrases) Eliminate words low content (e.g., and, thus,.. ) (stopwords) Map words with same meaning (synonyms) Strip plurals, conjugations, ... (stemming)

Define weighing scheme and/or transformations (tf-idf,svd,..)

Methodology

Compute index of textual resources:

T 1

T 3

T 2

vocabularySimilarity between documents Salton’s cosine:1 2

1 2

1 2

( , ) i ii i

i i

d dsim d d

d d

Overview



Results – Term statistics

19 papers 3610 withheld terms

(including ~400 bigrams)

Distance Matrix (19x19) Apply MDS Apply Clustering

Results – MDS

Results – MDS

Policy

Mathematicalapproaches

Webometrics

Results – Clustering

• Hierarchical clustering

Ward method

Cut-off k=4

• Optimal parameters ?

‘Stability-based method’

• Quantified correspondence with expert

assignments ?

‘Rand index’ ..

?

Results – Peer evaluation

ClassCluster

I II III IV V

1 3 4 1 0 0

2 0 0 0 3 0

3 0 0 1 0 3

4 1 0 2 1 0

PolicyMathematicalapproaches

Webometrics

Rand index = 0.778 p-value (w.r.t to permuted data) < 10-3 ; significant

Overview



Results – Reference age

Histograms per paper

Results – Reference age

Histograms aggregated by expert class

Results – Ref Age vs. % Serial

Scatter plot of Expert classes:Mean Reference Age vs. Percentage of Serials

Overview



Results – Term extraction

Calculation of seminal keywords for each article

Using TF-IDF weighting scheme Normalized to norm 1 to

accommodate for document length

Author(s): Persson et al. Inflationary bibliometric values: the role of scientific collaboration and the need for relative indicators in evaluative studies

Author(s): GlänzelTowards a model for diachronous and synchronous citation analyses

co_author 0.417794 diachronous_prospect 0.492265

collabor* 0.287652 synchronous 0.377403

domest* 0.208460 synchronous_retrospect 0.360994

self_citat* 0.185298 age 0.250921

explan* 0.170916 diachronous_prospect 0.238375

Growth 0.154099 technic*_reliabl* 0.180497

reference_list 0.151925 citat*_process 0.150553

intern*_collabor* 0.151925 life_time 0.147679

reference_behaviour 0.151468 impact_measur* 0.125460

inflationari 0.151468 random_select* 0.114862

Author(s): Moed and Garfield In basic science the percentage of 'authoritative' references decreases as bibliographies become shorter

Author(s): Shelton and HoldrigeThe US-EU race for leadership of science and technology, Qualitative and quantitative indicators

research_field 0.358836 EU 0.638957

authorit*_docum* 0.281942 WTEC 0.346503

authorit* 0.241017 panel 0.224208

docum* 0.197558 output_indic* 0.142678

referenc* 0.179418 NAS 0.142678

percent_most 0.179418 leadership 0.142678

refer*_list 0.176746 world 0.119689

refer* 0.165171 input 0.114998

frequent*_cite 0.156779 row 0.102220

persuasion 0.153787 panelist 0.101913

Author(s): Tang and ThelwallClass: IV

department 0.420497

intern*_inlink 0.315920

gTLD 0.273798

public_impact 0.189552

disciplin* 0.148494

psychologi 0.145234

command 0.145234

region 0.135706

histori 0.123676

disciplinari_differ* 0.105307

Results – Full-text vs Abstract

Is a full-text analysis warranted for term extraction ? for mapping purposes ?

Results – Full-text vs Abstract

Less structure Less overlap with

expert classes:

Rand index = 0.6257 p-value = 0.464 ;

not significant

Full-text is an interesting sourcefor additional keywords and improved mapping

Conclusion Keyword approach may be naïve But applied in a systematic framework

in combination with ‘right’ algorithms, it provides interesting clues

Complementary to bibliometric approaches

Weak indications towards benefits of using full-text articles

Future: extension of this pilot to larger samples

References

• Bibliometrics; homepage Wolfgang Glänzel

• http://www.steunpuntoos.be/wg.html

• Bibliometrics; homepage Olle Persson

• http://www.umu.se/inforsk/Staff/olle.htm

• Text & Data mining; PhD thesis Patrick Glenisson

• ftp://ftp.esat.kuleuven.ac.be/pub/sista/glenisson/reports/phd.pdf

• Optimal k in clustering; Stability method