automatic term identification for bibliometric mapping

14
1 Automatic Term Identification for Bibliometric Mapping Nees Jan van Eck, Ludo Waltman Erasmus University Rotterdam, The Netherlands {nvaneck,lwaltman}@few.eur.nl Ed Noyons, Renald Buter Centre for Science and Technology Studies, Leiden University, The Netherlands {noyons,buter}@cwts.leidenuniv.nl 10th International Conference on Science and Technology Indicators 1

Upload: waite

Post on 13-Jan-2016

26 views

Category:

Documents


1 download

DESCRIPTION

Automatic Term Identification for Bibliometric Mapping. Nees Jan van Eck, Ludo Waltman Erasmus University Rotterdam, The Netherlands {nvaneck,lwaltman}@few.eur.nl Ed Noyons, Renald Buter Centre for Science and Technology Studies , Leiden University, The Netherlands - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatic Term Identification for Bibliometric Mapping

1111

Automatic Term Identificationfor Bibliometric Mapping

Nees Jan van Eck, Ludo WaltmanErasmus University Rotterdam, The Netherlands

{nvaneck,lwaltman}@few.eur.nl

Ed Noyons, Renald ButerCentre for Science and Technology Studies, Leiden University,

The Netherlands{noyons,buter}@cwts.leidenuniv.nl

10th International Conference on Science and Technology Indicators

Vienna, September 18, 20081

Page 2: Automatic Term Identification for Bibliometric Mapping

2

Bibliometric mapping

Similarity measureDirect Indirect

Jaccard Cosine Association strength … Pearson

correlation Cosine …

Unit of analysisAuthors Journals Words/terms Web pages …

Mapping techniqueDistance based Graph based

MDS VxOrd VOS … Pajek Pathfinder networks …

Page 3: Automatic Term Identification for Bibliometric Mapping

3

Bibliometric mapping

Similarity measureDirect Indirect

Jaccard Cosine Association strength … Pearson

correlation Cosine …

Unit of analysisAuthors Journals Words/terms Web pages …

Mapping techniqueDistance based Graph based

MDS VxOrd VOS … Pajek Pathfinder networks …

Page 4: Automatic Term Identification for Bibliometric Mapping

444

Research problem

• Important authors or journals in a field can be identified relatively easily based on number of citations (i.e., frequency of occurrence in reference lists)

• Identification of important terms based on frequency of occurrence gives poor results, with many very general terms

• Terms are therefore usually identified manually based on expert judgment. This has the disadvantage of being– subjective– labor-intensive

• We propose a method for (semi-)automatic term identification

Page 5: Automatic Term Identification for Bibliometric Mapping

5

Method (1)

• General overview of the proposed method:

• Step 1 involves:– part-of-speech tagging– lemmatizing (stemming)– identifying noun phrases (linguistic filter)– identifying linguistic units (statistical filter; Dunning, 1993)

• Step 1 results in a list of linguistic units (noun phrases) that may or may not be terms

5

Step 1: Calculation of

unithood

Step 2: Calculation of

termhood

corpuslinguistic

units terms

Page 6: Automatic Term Identification for Bibliometric Mapping

6

Method (2)

• Step 2 is based on the following idea:

• Example:

6

A linguistic unit whose occurrences in a corpus of scientific texts are biased toward one or more topics is likely to refer to a domain-specific concept and, consequently, to be a term

Bibliometrics Webometrics Information retrieval

Hirsch index 93 8 2

recall 7 12 156

Web site 14 85 67

result 326 267 291

Page 7: Automatic Term Identification for Bibliometric Mapping

7

• How can different topics be identified in a corpus of scientific texts?

• We use a statistical latent class model called probabilistic latent semantic analysis (PLSA; Hofmann, 2001)

• PLSA provides a kind of fuzzy clustering of the linguistic units occurring in a corpus

• Each cluster corresponds with a topic

7

Method (3)

Page 8: Automatic Term Identification for Bibliometric Mapping

88

Method (4)

• The termhood of a linguistic unit is determined using an entropy-like criterion

Bibliometrics Webometrics Information retrieval

Hirsch index 93 8 2

recall 7 12 156

Web site 14 85 67

result 326 267 291

Bibliometrics Webometrics Information retrieval

Hirsch index 0.903 0.078 0.019

recall 0.040 0.069 0.891

Web site 0.084 0.512 0.404

result 0.369 0.302 0.329

Entropy

0.529

0.600

1.323

1.580

Page 9: Automatic Term Identification for Bibliometric Mapping

99

Application

• The proposed method is used to construct a term map of the operations research (OR) field

• The map is based on 7492 abstracts of papers published in OR journals between 2001 and 2005

• A two-step approach is taken:– First, terms are identified using the proposed method– Second, the relations between terms are visualized using the VOS

method

• The proposed method is evaluated in two ways:– Evaluation of the terms based on the criteria of precision and recall– Evaluation of the term map based on a survey among OR experts

Page 10: Automatic Term Identification for Bibliometric Mapping

1010

Precision and recall

• The proposed method (‘PLSA’) outperforms both a simple variant without PLSA (‘No PLSA’) and a naïve method based on frequency of occurrence (‘Frequency’)

Page 11: Automatic Term Identification for Bibliometric Mapping

1111

Page 12: Automatic Term Identification for Bibliometric Mapping

1212

Page 13: Automatic Term Identification for Bibliometric Mapping

13

Survey

• Until now, 3 OR experts have responded (2 assistant professors and 1 full professor)

Strong points Weak points• Good visualization of the

structure of the field• Clusters correspond quite

well with subfields• Some experts learned

something new from the map

• General terms in the center of the map

• A few important terms are missing

• Closely related terms are sometimes not very close in the map

Page 14: Automatic Term Identification for Bibliometric Mapping

16

Conclusions

• The results of the proposed method for (semi-)automatic term identification seem promising

• For accurate results, manual verification of the identified terms remains necessary

• The proposed method should be seen as a first step toward more accurate term maps for science policy decision making

16