bibliometric-enhanced retrieval models for big scholarly information systems

24
Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems [email protected] Workshop on Scholarly Big Data: Challenges and Ideas. IEEE BigData 2013

Upload: philipp-mayr

Post on 24-May-2015

252 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Bibliometric-enhanced Retrieval

Models for Big Scholarly Information

Systems

[email protected]

Workshop on Scholarly Big Data: Challenges and Ideas. IEEE BigData 2013

Page 2: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Intro

• What are Big Scholarly Information

Systems?

Page 3: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Intro

• What are bibliometric-enhanced IR

models?

– set of methods to quantitatively analyze

scientific and technological literature

– E.g. citation analysis (h-index)

– CiteSeer was a pioneer bibliometric-enhanced

IR system

Page 4: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Background

• DFG-funded (2009-2013): Projects IRM I and IRM II

– IRM = Information Retrieval Mehrwertdienste (value-added IR services)

• Goal: Implementation and evaluation of value-added IR services for

digital library systems

• Main idea: Applying scholarly (science) models for IR

Co-occurrence analysis of controlled vocabularies (thesauri)

Bibliometric analysis of core journals (Bradford’s law)

Centrality in author networks (betweenness)

• In IRM I we concentrated on the basic evaluation

• In IRM II we concentrate on the implementation of reusable (web)

services

4

http://www.gesis.org/en/research/external-funding-projects/archive/irm/

Page 5: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Search Term Recommender (Petras 2006)

Search Term Service: recommending strongly associated terms from controlled vocabulary

Page 6: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Bradfordizing (White 1981, Mayr 2009)

Bradford Law of Scattering (Bradford 1948): idealized example for 450 articles

Nucleus/Core: 150 papers in 3 Journals

Zone 2: 150 papers in 9 Journals

Zone 3: 150 papers in 27 Journals

Ranking by Bradfordizing: sorting the core journal papers / core books on top

bradfordized list of journals in informetrics applied to monographs: publisher as sorting criterion

Page 7: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Author Centrality (Mutschke 2001, 2004)

Ranking by Author Centrality: sorting central author papers on top

Page 8: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Scenarios for combined ranking services

iterative use : simultanous use:

Result Set

Core Journal Papers

Central Author Papers Relevant

Papers

Result Set

Central Author Papers Core Journal Papers

Page 9: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Prototye

http://multiweb.gesis.org/irsa/IRMPrototype

Page 10: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Evaluation

Page 11: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Main Research Issue:

Contribution to retrieval quality and usability

• Precision:

– Do central authors (core journals) provide more relevant hits?

– Do highly associated cowords have any positive effects?

• Value-adding effects:

– Do central authors (core journals) provide OTHER relevant hits?

– Do coword-relationships provide OTHER relevant search terms?

• Mashup effects:

– Do combinations of the services enhance the effects?

Page 12: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Evaluation Design

• precision in existing evaluation data:

– Clef 2003-2007: 125 topics; 65,297 SOLIS documents

– KoMoHe 2007: 39 topics; 31,155 SOLIS documents

• plausibility tests:

– author centrality / journal coreness ↔ precision

– Bradfordizing ↔ author centrality

• precision tests with users (Online-Assessment-Tool)

• usability tests with users (acceptance)

Page 13: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Evaluation of Bradfordizing on CLEF Data (Mayr 2013)

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

Bradford zones (core, z2, z3)

2003 articles 0,29 0,22 0,16

2004 articles 0,23 0,18 0,13

2005 articles 0,31 0,24 0,17

2006 articles 0,29 0,27 0,24

2007 articles 0,28 0,26 0,22

2005 monographs 0,21 0,16 0,19

2006 monographs 0,28 0,28 0,24

2007 monographs 0,24 0,21 0,23

core z2 z3

journal articles:

significant improvement

of precision from zone3

to core

monographs:

slight improvement of

precision distribution

between the three

zones

precision between Bradford zones (core, zone2 and zone3)

Page 14: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Evaluation of Author Centrality on CLEF Data

• moderate positive relationship between

rate of networking and precision

• precision of TF-IDF rankings (0.60)

significantly higher than author centrality

based rankings (0.31) – BUT:

• very little overlap of documents on top of

the ranking lists: 90% of relevant hits

provided by author centrality did not appear

on top of TF-IDF rankings

→ added precision of 28%

0

20

40

60

80

100

120

140

0 0,2 0,4 0,6 0,8 1 1,2

Gia

nt

Size

Precision

Correlation Precision10 - Giant Size: 0.25

• author centrality seems to favor OTHER

relevant documents than traditional rankings

• value-adding effect:

other view to the information space

avg number docs 517

avg number authors 664

avg number co-authors 302

avg giant size 24

Page 15: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Result: overlap

Intersection of

suggested top n=10

documents over all

topics and services

Mutschke et al. 2011

top 10 result lists

are marginal

overlapping!

Page 16: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

IRSA

16

Page 17: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

17

IRSA: Workflow

Page 18: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Analysis

18

Page 19: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Output

19

Returning suggestions for any query term

Page 20: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Integration

20

www.sowiport.de is

using query suggestions

from IRSA

Page 21: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

IRM & Modeling Science

measuring contribution

of bibliometric-enhanced services

to retrieval quality

deeper insights in

structure & functioning

of science

Bibliometric-enhanced

services

(structural attributes of

science system)

way towards a formal

model of science

Page 22: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

References • Mutschke, P., Mayr, P., Schaer, P., & Sure, Y. (2011). Science models as value-

added services for scholarly information systems. Scientometrics, 89(1), 349–

364. doi:10.1007/s11192-011-0430-x

• Lüke, T., Schaer, P., & Mayr, P. (2013). A framework for specific term

recommendation systems. In Proceedings of the 36th international ACM SIGIR

conference on Research and development in information retrieval - SIGIR ’13

(pp. 1093–1094). New York, New York, USA: ACM Press.

doi:10.1145/2484028.2484207

• Mayr, P. (2013). Relevance distributions across Bradford Zones: Can

Bradfordizing improve search? In J. Gorraiz, E. Schiebel, C. Gumpenberger, M.

Hörlesberger, & H. Moed (Eds.), 14th International Society of Scientometrics

and Informetrics Conference (pp. 1493–1505). Vienna, Austria. Retrieved from

http://arxiv.org/abs/1305.0357

• Hienert, D., Schaer, P., Schaible, J., & Mayr, P. (2011). A Novel Combined Term

Suggestion Service for Domain-Specific Digital Libraries. In S. Gradmann, F.

Borri, C. Meghini, & H. Schuldt (Eds.), International Conference on Theory and

Practice of Digital Libraries (TPDL) (pp. 192–203). Berlin: Springer.

doi:10.1007/978-3-642-24469-8_21 22

Page 24: Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems

Thank you!

Dr Philipp Mayr

GESIS Leibniz Institute for the Social Sciences

Unter Sachsenhausen 6-8

50667 Cologne

Germany

[email protected]

24