Automatic Term Identification for Bibliometric Mapping
description
Transcript of Automatic Term Identification for Bibliometric Mapping
1111
Automatic Term Identificationfor Bibliometric Mapping
Nees Jan van Eck, Ludo WaltmanErasmus University Rotterdam, The Netherlands
{nvaneck,lwaltman}@few.eur.nl
Ed Noyons, Renald ButerCentre for Science and Technology Studies, Leiden University,
The Netherlands{noyons,buter}@cwts.leidenuniv.nl
10th International Conference on Science and Technology Indicators
Vienna, September 18, 20081
2
Bibliometric mapping
Similarity measureDirect Indirect
Jaccard Cosine Association strength … Pearson
correlation Cosine …
Unit of analysisAuthors Journals Words/terms Web pages …
Mapping techniqueDistance based Graph based
MDS VxOrd VOS … Pajek Pathfinder networks …
3
Bibliometric mapping
Similarity measureDirect Indirect
Jaccard Cosine Association strength … Pearson
correlation Cosine …
Unit of analysisAuthors Journals Words/terms Web pages …
Mapping techniqueDistance based Graph based
MDS VxOrd VOS … Pajek Pathfinder networks …
444
Research problem
• Important authors or journals in a field can be identified relatively easily based on number of citations (i.e., frequency of occurrence in reference lists)
• Identification of important terms based on frequency of occurrence gives poor results, with many very general terms
• Terms are therefore usually identified manually based on expert judgment. This has the disadvantage of being– subjective– labor-intensive
• We propose a method for (semi-)automatic term identification
5
Method (1)
• General overview of the proposed method:
• Step 1 involves:– part-of-speech tagging– lemmatizing (stemming)– identifying noun phrases (linguistic filter)– identifying linguistic units (statistical filter; Dunning, 1993)
• Step 1 results in a list of linguistic units (noun phrases) that may or may not be terms
5
Step 1: Calculation of
unithood
Step 2: Calculation of
termhood
corpuslinguistic
units terms
6
Method (2)
• Step 2 is based on the following idea:
• Example:
6
A linguistic unit whose occurrences in a corpus of scientific texts are biased toward one or more topics is likely to refer to a domain-specific concept and, consequently, to be a term
Bibliometrics Webometrics Information retrieval
Hirsch index 93 8 2
recall 7 12 156
Web site 14 85 67
result 326 267 291
7
• How can different topics be identified in a corpus of scientific texts?
• We use a statistical latent class model called probabilistic latent semantic analysis (PLSA; Hofmann, 2001)
• PLSA provides a kind of fuzzy clustering of the linguistic units occurring in a corpus
• Each cluster corresponds with a topic
7
Method (3)
88
Method (4)
• The termhood of a linguistic unit is determined using an entropy-like criterion
Bibliometrics Webometrics Information retrieval
Hirsch index 93 8 2
recall 7 12 156
Web site 14 85 67
result 326 267 291
Bibliometrics Webometrics Information retrieval
Hirsch index 0.903 0.078 0.019
recall 0.040 0.069 0.891
Web site 0.084 0.512 0.404
result 0.369 0.302 0.329
Entropy
0.529
0.600
1.323
1.580
99
Application
• The proposed method is used to construct a term map of the operations research (OR) field
• The map is based on 7492 abstracts of papers published in OR journals between 2001 and 2005
• A two-step approach is taken:– First, terms are identified using the proposed method– Second, the relations between terms are visualized using the VOS
method
• The proposed method is evaluated in two ways:– Evaluation of the terms based on the criteria of precision and recall– Evaluation of the term map based on a survey among OR experts
1010
Precision and recall
• The proposed method (‘PLSA’) outperforms both a simple variant without PLSA (‘No PLSA’) and a naïve method based on frequency of occurrence (‘Frequency’)
1111
1212
13
Survey
• Until now, 3 OR experts have responded (2 assistant professors and 1 full professor)
Strong points Weak points• Good visualization of the
structure of the field• Clusters correspond quite
well with subfields• Some experts learned
something new from the map
• General terms in the center of the map
• A few important terms are missing
• Closely related terms are sometimes not very close in the map
16
Conclusions
• The results of the proposed method for (semi-)automatic term identification seem promising
• For accurate results, manual verification of the identified terms remains necessary
• The proposed method should be seen as a first step toward more accurate term maps for science policy decision making
16