Automatic Term Identification for Bibliometric Mapping

1111

Automatic Term Identificationfor Bibliometric Mapping

Nees Jan van Eck, Ludo WaltmanErasmus University Rotterdam, The Netherlands

{nvaneck,lwaltman}@few.eur.nl

Ed Noyons, Renald ButerCentre for Science and Technology Studies, Leiden University,

The Netherlands{noyons,buter}@cwts.leidenuniv.nl

10th International Conference on Science and Technology Indicators

Vienna, September 18, 20081

2

Bibliometric mapping

Similarity measureDirect Indirect

Jaccard Cosine Association strength … Pearson

correlation Cosine …

Unit of analysisAuthors Journals Words/terms Web pages …

Mapping techniqueDistance based Graph based

MDS VxOrd VOS … Pajek Pathfinder networks …

3

Bibliometric mapping

Similarity measureDirect Indirect

Jaccard Cosine Association strength … Pearson

correlation Cosine …

Unit of analysisAuthors Journals Words/terms Web pages …

Mapping techniqueDistance based Graph based

MDS VxOrd VOS … Pajek Pathfinder networks …

444

Research problem

• Important authors or journals in a field can be identified relatively easily based on number of citations (i.e., frequency of occurrence in reference lists)

• Identification of important terms based on frequency of occurrence gives poor results, with many very general terms

• Terms are therefore usually identified manually based on expert judgment. This has the disadvantage of being– subjective– labor-intensive

• We propose a method for (semi-)automatic term identification

5

Method (1)

• General overview of the proposed method:

• Step 1 involves:– part-of-speech tagging– lemmatizing (stemming)– identifying noun phrases (linguistic filter)– identifying linguistic units (statistical filter; Dunning, 1993)

• Step 1 results in a list of linguistic units (noun phrases) that may or may not be terms

5

Step 1: Calculation of

unithood

Step 2: Calculation of

termhood

corpuslinguistic

units terms

6

Method (2)

• Step 2 is based on the following idea:

• Example:

6

A linguistic unit whose occurrences in a corpus of scientific texts are biased toward one or more topics is likely to refer to a domain-specific concept and, consequently, to be a term

Bibliometrics Webometrics Information retrieval

Hirsch index 93 8 2

recall 7 12 156

Web site 14 85 67

result 326 267 291

7

• How can different topics be identified in a corpus of scientific texts?

• We use a statistical latent class model called probabilistic latent semantic analysis (PLSA; Hofmann, 2001)

• PLSA provides a kind of fuzzy clustering of the linguistic units occurring in a corpus

• Each cluster corresponds with a topic

7

Method (3)

88

Method (4)

• The termhood of a linguistic unit is determined using an entropy-like criterion


Hirsch index 93 8 2

recall 7 12 156

Web site 14 85 67

result 326 267 291


Hirsch index 0.903 0.078 0.019

recall 0.040 0.069 0.891

Web site 0.084 0.512 0.404

result 0.369 0.302 0.329

Entropy

0.529

0.600

1.323

1.580

99

Application

• The proposed method is used to construct a term map of the operations research (OR) field

• The map is based on 7492 abstracts of papers published in OR journals between 2001 and 2005

• A two-step approach is taken:– First, terms are identified using the proposed method– Second, the relations between terms are visualized using the VOS

method

• The proposed method is evaluated in two ways:– Evaluation of the terms based on the criteria of precision and recall– Evaluation of the term map based on a survey among OR experts

1010

Precision and recall

• The proposed method (‘PLSA’) outperforms both a simple variant without PLSA (‘No PLSA’) and a naïve method based on frequency of occurrence (‘Frequency’)

13

Survey

• Until now, 3 OR experts have responded (2 assistant professors and 1 full professor)

Strong points Weak points• Good visualization of the

structure of the field• Clusters correspond quite

well with subfields• Some experts learned

something new from the map

• General terms in the center of the map

• A few important terms are missing

• Closely related terms are sometimes not very close in the map

16

Conclusions

• The results of the proposed method for (semi-)automatic term identification seem promising

• For accurate results, manual verification of the identified terms remains necessary

• The proposed method should be seen as a first step toward more accurate term maps for science policy decision making

16

Automatic Term Identification for Bibliometric Mapping

Documents

Transcript of Automatic Term Identification for Bibliometric Mapping