Combining Full-text analysis & Bibliometric Indicators
a pilot study
Patrick Glenisson 1
Wolfgang Glänzel 1,2
Olle Persson 3
1. Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium)
2. Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary)
3. Inforsk, Department of Sociology, Umeå University, Umeå (Sweden)
Introduction
Goal: mapping of scientific processes Map of scientific papers Characterization of emerging clusters Extraction of new search keys
Using bibliometric as well as lexical indicators of ‘relatedness’
Full-text analysis
Overview
Data sources and Questions asked Text mining Ingredients Text-based relational analysis of
documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion
Overview
Data sources and Questions asked Text mining Ingredients Text-based relational analysis of
documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion
Data source 19 full-text papers from:
Scientometrics, Vol 30, Issue 3 (2004)
special issue on 9th international conference on Scientometrics and Informetrics (Beijing, China)
Validation setup Manual assignment in various classes ..
Data source
Section code Section name Paper
I Advances in Scientometrics Havemann et al. (2004)Moed and Garfield (2004)Small (2004)Yue and Wilson (2004)
II Policy relevant issues Negishi et al. (2004)Shelton and Holdrige (2004)Markusova et al. (2004)Wu et al. (2004)
III Bibliometric approaches to collaboration in science
Beaver (2004)Kretschmer (2004)Persson et al. (2004)Yoshikane and Kageura (2004)
IV Advances in Informetrics and Webometrics
Lamirel et al. (2004)Qiu and Chen (2004)Tang and Thelwall (2004)Vaughan and Wu (2004)
V Mathematical models in Informetrics and Scientometrics
Egghe (2004)Glänzel (2004)Shan et al. (2004)
Research questions
Comparison text-based mapping vs. expert classification
Extracted keywords
Comparison with bibliometric mapping
Overview
Data sources and Questions asked Text mining Ingredients Text-based relational analysis of
documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion
Methodology
Given a set of documents,
Methodology
<1 0 0 1 0 1>
<1 1 0 0 0 1>
<0 0 0 1 1 0>
Given a set of documents,
compute a representation, called index
Methodology
<1 0 0 1 0 1>
<1 1 0 0 0 1>
<0 0 0 1 1 0>
Given a set of documents,
compute a representation, called index
to retrieve, summarize, classify or cluster them
Methodology
Document processing Remove punctuation & grammatical structure (‘Bag of words’ ) Define a vocabulary
Identify Multi-word terms (e.g., tumor suppressor) (phrases) Eliminate words low content (e.g., and, thus,.. ) (stopwords) Map words with same meaning (synonyms) Strip plurals, conjugations, ... (stemming)
Define weighing scheme and/or transformations (tf-idf,svd,..)
Methodology
Compute index of textual resources:
T 1
T 3
T 2
vocabularySimilarity between documents Salton’s cosine:1 2
1 2
1 2
( , ) i ii i
i i
d dsim d d
d d
Overview
Data sources and Questions asked Text mining Ingredients Text-based relational analysis of
documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion
Results – Term statistics
19 papers 3610 withheld terms
(including ~400 bigrams)
Distance Matrix (19x19) Apply MDS Apply Clustering
Results – MDS
Results – MDS
Policy
Mathematicalapproaches
Webometrics
Results – Clustering
• Hierarchical clustering
Ward method
Cut-off k=4
• Optimal parameters ?
‘Stability-based method’
• Quantified correspondence with expert
assignments ?
‘Rand index’ ..
?
Results – Peer evaluation
ClassCluster
I II III IV V
1 3 4 1 0 0
2 0 0 0 3 0
3 0 0 1 0 3
4 1 0 2 1 0
PolicyMathematicalapproaches
Webometrics
Rand index = 0.778 p-value (w.r.t to permuted data) < 10-3 ; significant
Overview
Data sources and Questions asked Text mining Ingredients Text-based relational analysis of
documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion
Results – Reference age
Histograms per paper
Results – Reference age
Histograms aggregated by expert class
Results – Ref Age vs. % Serial
Scatter plot of Expert classes:Mean Reference Age vs. Percentage of Serials
Overview
Data sources and Questions asked Text mining Ingredients Text-based relational analysis of
documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion
Results – Term extraction
Calculation of seminal keywords for each article
Using TF-IDF weighting scheme Normalized to norm 1 to
accommodate for document length
Author(s): Persson et al. Inflationary bibliometric values: the role of scientific collaboration and the need for relative indicators in evaluative studies
Author(s): GlänzelTowards a model for diachronous and synchronous citation analyses
co_author 0.417794 diachronous_prospect 0.492265
collabor* 0.287652 synchronous 0.377403
domest* 0.208460 synchronous_retrospect 0.360994
self_citat* 0.185298 age 0.250921
explan* 0.170916 diachronous_prospect 0.238375
Growth 0.154099 technic*_reliabl* 0.180497
reference_list 0.151925 citat*_process 0.150553
intern*_collabor* 0.151925 life_time 0.147679
reference_behaviour 0.151468 impact_measur* 0.125460
inflationari 0.151468 random_select* 0.114862
Author(s): Moed and Garfield In basic science the percentage of 'authoritative' references decreases as bibliographies become shorter
Author(s): Shelton and HoldrigeThe US-EU race for leadership of science and technology, Qualitative and quantitative indicators
research_field 0.358836 EU 0.638957
authorit*_docum* 0.281942 WTEC 0.346503
authorit* 0.241017 panel 0.224208
docum* 0.197558 output_indic* 0.142678
referenc* 0.179418 NAS 0.142678
percent_most 0.179418 leadership 0.142678
refer*_list 0.176746 world 0.119689
refer* 0.165171 input 0.114998
frequent*_cite 0.156779 row 0.102220
persuasion 0.153787 panelist 0.101913
Author(s): Tang and ThelwallClass: IV
department 0.420497
intern*_inlink 0.315920
gTLD 0.273798
public_impact 0.189552
disciplin* 0.148494
psychologi 0.145234
command 0.145234
region 0.135706
histori 0.123676
disciplinari_differ* 0.105307
Results – Full-text vs Abstract
Is a full-text analysis warranted for term extraction ? for mapping purposes ?
Results – Full-text vs Abstract
Less structure Less overlap with
expert classes:
Rand index = 0.6257 p-value = 0.464 ;
not significant
Full-text is an interesting sourcefor additional keywords and improved mapping
Conclusion Keyword approach may be naïve But applied in a systematic framework
in combination with ‘right’ algorithms, it provides interesting clues
Complementary to bibliometric approaches
Weak indications towards benefits of using full-text articles
Future: extension of this pilot to larger samples
References
• Bibliometrics; homepage Wolfgang Glänzel
• http://www.steunpuntoos.be/wg.html
• Bibliometrics; homepage Olle Persson
• http://www.umu.se/inforsk/Staff/olle.htm
• Text & Data mining; PhD thesis Patrick Glenisson
• ftp://ftp.esat.kuleuven.ac.be/pub/sista/glenisson/reports/phd.pdf
• Optimal k in clustering; Stability method
Top Related