Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang...

30
Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1. Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium) 2. Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary) 3. Inforsk, Department of Sociology, Umeå University, Umeå (Sweden)

Transcript of Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang...

Page 1: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Combining Full-text analysis & Bibliometric Indicators

a pilot study

Patrick Glenisson 1

Wolfgang Glänzel 1,2

Olle Persson 3

1. Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium)

2. Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary)

3. Inforsk, Department of Sociology, Umeå University, Umeå (Sweden)

Page 2: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Introduction

Goal: mapping of scientific processes Map of scientific papers Characterization of emerging clusters Extraction of new search keys

Using bibliometric as well as lexical indicators of ‘relatedness’

Full-text analysis

Page 3: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Overview

Data sources and Questions asked Text mining Ingredients Text-based relational analysis of

documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Page 4: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Overview

Data sources and Questions asked Text mining Ingredients Text-based relational analysis of

documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Page 5: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Data source 19 full-text papers from:

Scientometrics, Vol 30, Issue 3 (2004)

special issue on 9th international conference on Scientometrics and Informetrics (Beijing, China)

Validation setup Manual assignment in various classes ..

Page 6: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Data source

Section code Section name Paper

I Advances in Scientometrics Havemann et al. (2004)Moed and Garfield (2004)Small (2004)Yue and Wilson (2004)

II Policy relevant issues Negishi et al. (2004)Shelton and Holdrige (2004)Markusova et al. (2004)Wu et al. (2004)

III Bibliometric approaches to collaboration in science

Beaver (2004)Kretschmer (2004)Persson et al. (2004)Yoshikane and Kageura (2004)

IV Advances in Informetrics and Webometrics

Lamirel et al. (2004)Qiu and Chen (2004)Tang and Thelwall (2004)Vaughan and Wu (2004)

V Mathematical models in Informetrics and Scientometrics

Egghe (2004)Glänzel (2004)Shan et al. (2004)

Page 7: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Research questions

Comparison text-based mapping vs. expert classification

Extracted keywords

Comparison with bibliometric mapping

Page 8: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Overview

Data sources and Questions asked Text mining Ingredients Text-based relational analysis of

documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Page 9: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Methodology

Given a set of documents,

Page 10: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Methodology

<1 0 0 1 0 1>

<1 1 0 0 0 1>

<0 0 0 1 1 0>

Given a set of documents,

compute a representation, called index

Page 11: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Methodology

<1 0 0 1 0 1>

<1 1 0 0 0 1>

<0 0 0 1 1 0>

Given a set of documents,

compute a representation, called index

to retrieve, summarize, classify or cluster them

Page 12: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Methodology

Document processing Remove punctuation & grammatical structure (‘Bag of words’ ) Define a vocabulary

Identify Multi-word terms (e.g., tumor suppressor) (phrases) Eliminate words low content (e.g., and, thus,.. ) (stopwords) Map words with same meaning (synonyms) Strip plurals, conjugations, ... (stemming)

Define weighing scheme and/or transformations (tf-idf,svd,..)

Page 13: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Methodology

Compute index of textual resources:

T 1

T 3

T 2

vocabularySimilarity between documents Salton’s cosine:1 2

1 2

1 2

( , ) i ii i

i i

d dsim d d

d d

Page 14: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Overview

Data sources and Questions asked Text mining Ingredients Text-based relational analysis of

documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Page 15: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Results – Term statistics

19 papers 3610 withheld terms

(including ~400 bigrams)

Distance Matrix (19x19) Apply MDS Apply Clustering

Page 16: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Results – MDS

Page 17: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Results – MDS

Policy

Mathematicalapproaches

Webometrics

Page 18: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Results – Clustering

• Hierarchical clustering

Ward method

Cut-off k=4

• Optimal parameters ?

‘Stability-based method’

• Quantified correspondence with expert

assignments ?

‘Rand index’ ..

?

Page 19: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Results – Peer evaluation

ClassCluster

I II III IV V

1 3 4 1 0 0

2 0 0 0 3 0

3 0 0 1 0 3

4 1 0 2 1 0

PolicyMathematicalapproaches

Webometrics

Rand index = 0.778 p-value (w.r.t to permuted data) < 10-3 ; significant

Page 20: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Overview

Data sources and Questions asked Text mining Ingredients Text-based relational analysis of

documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Page 21: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Results – Reference age

Histograms per paper

Page 22: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Results – Reference age

Histograms aggregated by expert class

Page 23: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Results – Ref Age vs. % Serial

Scatter plot of Expert classes:Mean Reference Age vs. Percentage of Serials

Page 24: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Overview

Data sources and Questions asked Text mining Ingredients Text-based relational analysis of

documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Page 25: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Results – Term extraction

Calculation of seminal keywords for each article

Using TF-IDF weighting scheme Normalized to norm 1 to

accommodate for document length

Page 26: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Author(s): Persson et al. Inflationary bibliometric values: the role of scientific collaboration and the need for relative indicators in evaluative studies

Author(s): GlänzelTowards a model for diachronous and synchronous citation analyses

co_author 0.417794 diachronous_prospect 0.492265

collabor* 0.287652 synchronous 0.377403

domest* 0.208460 synchronous_retrospect 0.360994

self_citat* 0.185298 age 0.250921

explan* 0.170916 diachronous_prospect 0.238375

Growth 0.154099 technic*_reliabl* 0.180497

reference_list 0.151925 citat*_process 0.150553

intern*_collabor* 0.151925 life_time 0.147679

reference_behaviour 0.151468 impact_measur* 0.125460

inflationari 0.151468 random_select* 0.114862

Author(s): Moed and Garfield In basic science the percentage of 'authoritative' references decreases as bibliographies become shorter

Author(s): Shelton and HoldrigeThe US-EU race for leadership of science and technology, Qualitative and quantitative indicators

research_field 0.358836 EU 0.638957

authorit*_docum* 0.281942 WTEC 0.346503

authorit* 0.241017 panel 0.224208

docum* 0.197558 output_indic* 0.142678

referenc* 0.179418 NAS 0.142678

percent_most 0.179418 leadership 0.142678

refer*_list 0.176746 world 0.119689

refer* 0.165171 input 0.114998

frequent*_cite 0.156779 row 0.102220

persuasion 0.153787 panelist 0.101913

Author(s): Tang and ThelwallClass: IV

department 0.420497

intern*_inlink 0.315920

gTLD 0.273798

public_impact 0.189552

disciplin* 0.148494

psychologi 0.145234

command 0.145234

region 0.135706

histori 0.123676

disciplinari_differ* 0.105307

Page 27: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Results – Full-text vs Abstract

Is a full-text analysis warranted for term extraction ? for mapping purposes ?

Page 28: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Results – Full-text vs Abstract

Less structure Less overlap with

expert classes:

Rand index = 0.6257 p-value = 0.464 ;

not significant

Full-text is an interesting sourcefor additional keywords and improved mapping

Page 29: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Conclusion Keyword approach may be naïve But applied in a systematic framework

in combination with ‘right’ algorithms, it provides interesting clues

Complementary to bibliometric approaches

Weak indications towards benefits of using full-text articles

Future: extension of this pilot to larger samples

Page 30: Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

References

• Bibliometrics; homepage Wolfgang Glänzel

• http://www.steunpuntoos.be/wg.html

• Bibliometrics; homepage Olle Persson

• http://www.umu.se/inforsk/Staff/olle.htm

• Text & Data mining; PhD thesis Patrick Glenisson

• ftp://ftp.esat.kuleuven.ac.be/pub/sista/glenisson/reports/phd.pdf

• Optimal k in clustering; Stability method