Advanced citation matching and large-scale cited reference extraction
-
Upload
nees-jan-van-eck -
Category
Science
-
view
161 -
download
3
Transcript of Advanced citation matching and large-scale cited reference extraction
![Page 1: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/1.jpg)
Advanced citation matching and large-scale cited reference extraction
Nees Jan van Eck
Centre for Science and Technology Studies (CWTS), Leiden University
EXCITE Workshop 2017: “Challenges in Extracting and Managing References”
Cologne, Germany, March 30, 2017
![Page 2: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/2.jpg)
Outline
• Citation matching– Comparison of the accuracy of the Web of Science, CWTS, and
iFQ citation matching algorithms
• Cited reference extraction– Assessment of the accuracy of cited references in Web of Science
based on Elsevier ScienceDirect data
1
![Page 3: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/3.jpg)
Accuracy of the WoS, CWTS, and iFQcitation matching algorithms
2
![Page 4: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/4.jpg)
3
![Page 5: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/5.jpg)
Citation matching problem
4
…
References
[1] Hirsch, JE (2005)PNAS, 102, p.16569
[2] Egghe, L (2006)Scientist, 20, p.15
…
An index to quantify an individual's scientific research output
Hirsch, JE
PNAS, 102(46), p.16569-72UT: 000233462900010
Abstract…
How to improve the h-index
Egghe, L
The Scientist, 20(3), p.15UT: 000235634200013
Abstract…
Bibliographic databaseWoS, Scopus
A
B
C
![Page 6: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/6.jpg)
Why is citation matching difficult?
• ‘Big data’ problem– No. of publications: 50 million– No. of cited references: 1 billion
• Little data available on cited references in WoS– First author (last name and initials)– Source title (abbreviated)– Publication year– Volume number– First page number– (DOI)
• Errors in data– Citation extraction errors
• OCR errors• Interpretation errors due to different citation styles
– Typos and other human errors
5
/A Olensky, M/Y 2015/W J ASS INFORM SCI TEC/V 67/P 2550
![Page 7: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/7.jpg)
Citation matching algorithms of WoS
• Little is known about the citation matching algorithm used in WoS
• Larsen et al. (2007) concluded from their investigation of missed matches in WoS that the algorithm is quite conservative and does not allow for any variations
6
![Page 8: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/8.jpg)
Citation matching algorithm of CWTS
• The aim is to overcome the problem of missed citation matches in WoS
• Iterative, rule-based algorithm:1. Preprocessing
2. Start with the most restrictive matching rules
3. Continue with less restrictive matching rules
• Less restrictive matching rules allow for various types of inaccuracies in the cited reference data
7
![Page 9: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/9.jpg)
Example matching rules
• Most restrictive matching rule:– Exact match on
• first author• publication year• publication name• volume number• starting page number• DOI
• Less restrictive matching rule:– Match on
• Soundex encoding of the last name of the first author• publication year plus or minus one• volume number• starting page number
8
![Page 10: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/10.jpg)
Citation matching algorithms of iFQ
• Iterative, rule-based algorithm
• Allows non-unique matches of a single cited reference with several target articles
9
![Page 11: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/11.jpg)
Data collection (1)
• Builds on data collected by Olensky (2015)
• Sample of 300 publications (cited pubs)– 2 science domains
– 6 disciplines
– 2 languages
– 2 publication years
• 3975 corresponding cited references in WoS– Times cited used to find cited references that are linked in WoS
– Cited reference search used to find cited references that are not linked in WoS
10
![Page 12: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/12.jpg)
Data collection (2)
11
300 cited pubs
3975 citing pubs
not linkedin WoS
linkedin WoS
![Page 13: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/13.jpg)
Results
12
All matches WoS CWTS iFQ
# % # % # %
Correct matches 3664 99.2 3855 98.8 3856 98.5
Incorrect matches 29 0.8 45 1.2 57 1.5
All citations WoS CWTS iFQ
# % # % # %
Correct matches 3664 93.8 3855 98.6 3856 98.7
Missed matches 244 6.2 53 1.4 52 1.3
Recall
Precision
WoS CWTS iFQ
F1 score 96.4 98.7 98.6
![Page 14: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/14.jpg)
Qualitative analysis of missed matches(WOS: 244; CWTS: 53; iFQ: 52)
13
![Page 15: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/15.jpg)
Changes in CWTS citation matching algorithm
• Introduction of a matching rules in which:1. Volume and issue number are interchanged
2. Volume and first page number are interchanged
• Small change in the order in which the matching rules are applied
14
![Page 16: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/16.jpg)
Improved results
15
All matches CWTS (current) CWTS (revised)
# % # %
Correct matches 3888 99.6 3906 99.8
Incorrect matches 16 0.4 9 0.2
All citations CWTS (current) CWTS (revised)
# % # %
Correct matches 3888 98.7 3906 99.1
Missed matches 53 1.3 35 0.9
Recall
Precision
CWTS (current) CWTS (revised)
F1 score 99.1 99.4
![Page 17: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/17.jpg)
Conclusions
• A significant number of citation matches are missing in WoS
• Substantial improvement in recall is possible, but at the cost of a small decrease in precision
• Citation matching algorithm of CWTS performs quite well
• During the analysis, various problems were detected in WoS cited reference extraction
16
![Page 18: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/18.jpg)
Accuracy of WoS cited reference extraction
17
![Page 19: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/19.jpg)
Introduction
• Aim: To determine the accuracy of WoS cited references data
• Approach: Comparison of the cited references extracted from the full text of Elsevier publications with the cited references available in WoS
18
![Page 20: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/20.jpg)
Data
• Elsevier full text data– ScienceDirect API
– Subscription-based journal publications in the period 1987-2016
• WoS meta data– Document types ‘article’ and ‘review’
• Matching of Elsevier full-text data and WoS meta data at the level of individual publications
19
![Page 21: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/21.jpg)
20
![Page 22: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/22.jpg)
21
![Page 23: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/23.jpg)
22
![Page 24: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/24.jpg)
Validation of missing cited references in WoS
• Publication year: 2015
• Number of missing cited references: 73,536
• Sample size: 60– Missing cited reference: 33 (55.0%)
– Incorrect cited reference: 10 (16.7%)
– Error in meta data of cited reference (e.g., incorrect publication year or incorrect volume number): 16 (26.7%)
– Correct cited reference: 1 (1.5%)
23
![Page 25: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/25.jpg)
Missing cited references in WoS (1)
24
![Page 26: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/26.jpg)
Missing cited references in WoS (2)
25
![Page 27: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/27.jpg)
Missing cited references in WoS (3)
26
![Page 28: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/28.jpg)
Missing cited references in WoS (4)
27
![Page 29: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/29.jpg)
Missing cited references in WoS (5)
28
![Page 30: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/30.jpg)
Missing cited references in WoS (6)
29
![Page 31: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/31.jpg)
Missing cited references in WoS (7)
30
![Page 32: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/32.jpg)
Missing cited references in WoS (8)
31
![Page 33: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/33.jpg)
Missing cited references in WoS (9)
32
![Page 34: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/34.jpg)
Incorrect cited references in WoS (1)
33
![Page 35: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/35.jpg)
Incorrect cited references in WoS (2)
34
![Page 36: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/36.jpg)
Incorrect cited references in WoS (3)
35
![Page 37: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/37.jpg)
Incorrect cited references in WoS (4)
36
![Page 38: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/38.jpg)
Incorrect cited references in WoS (5)
37
![Page 39: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/39.jpg)
Incorrect cited references in WoS (6)
38
WoS cited reference Original cited reference in publication
WANG J, 2006, CHINESE CHEM LETT, V17, P49
J. Wang, J.K. Carson, M.F. North, D.J. Cleland, Int. J. Heat Mass Transfer 49 (17) (2006) 3075–3083.
KANBER B, 2013, CEREBROVASC DIS S2, V35, P21
Kanber B, Hartshorne TC, Horsfield MA, Naylor AR, Robinson TG, Ramnarine KV. Dynamic variations in the ultrasound gray-scale median of carotid artery plaques. Cardiovasc Ultrasound 2013a;11:21.
EVANS P, 2010, TLS-TIMES LIT S 0326, P30
Evans PD, Chowdhury MJA. Photoprotection of wood using polyester-type UVabsorbersderived from the reaction of 2 hydroxy-4(2,3-epoxypropoxy)-benzophenone with dicarboxylic acid anhydrides. J Wood ChemTechnol 2010;30:186e204.
![Page 40: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/40.jpg)
Incorrect cited references in WoS (7)
39
WoS cited reference Original cited reference in publication
CAO X, 2010, IEEE GLOBECOMM 2010, V2010, P1
Cao, X., Zong, Z., Ju, X., Sun, Y., Dai, C., Liu, Q., Jiang, J., 2010. Molecular cloning, characterization and function analysis of the gene encoding HMG-CoA reductase from Euphorbia Pekinensis Rupr. Mol. Biol. Rep. 37, 1559e1567.
LI XY, 2013, NANJING NONGYE DAXUE, V36, P36
X. Li, S. Wang, Y. Chen, G. Liu, X. Yang, Overexpression of CD40 in sacral chordomasand its correlation with low tumor recurrence, Onkologie 36 (10) (2013) 567–571
ZHANG K, 2014, IEEE T PATTERN ANAL, V1, P1
K. Zhang, H. Chen, G. Wu, K. Chen, H. Yang, High expression of SPHK1 in sacral chordomaand association with patients’ poor prognosis, Med. Oncol. 31 (11) (2014) 247.
![Page 41: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/41.jpg)
More cited references in WoS (1)
40
![Page 42: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/42.jpg)
More cited references in WoS (2)
41
![Page 43: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/43.jpg)
More cited references in WoS (3)
42
![Page 44: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/44.jpg)
More cited references in WoS (4)
43
![Page 45: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/45.jpg)
More cited references in WoS (5)
44
![Page 46: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/46.jpg)
More cited references in WoS (6)
45
![Page 47: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/47.jpg)
Conclusions
• About 0.3% of cited references are missing in WoS
• About 0.2% of cited references in WoS have minor errors (e.g., incorrect publication year or volume number)
• About 0.1% of cited references in WoS have major errors (i.e., reference to completely incorrect target document)
• WoS does a good job in handling references pointing to multiple target documents
• These results are based on Elsevier publications only; publications from other publishers may yield different outcomes 46
![Page 48: Advanced citation matching and large-scale cited reference extraction](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a6d1b077f8b9ab3418b581f/html5/thumbnails/48.jpg)
Thank you for your attention!
47