ala 2010 -- johan bollen

30
The MESUR project: an overview and update Johan Bollen Johan Bollen Indiana University School of Informatics and Computing Center for Complex Networks and System Research [email protected] Acknowledgements: Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Ryan Chute (LANL), Lyudmila L. Balakireva (LANL), Aric Hagberg (LANL), Luis Bettencourt (LANL) Research supported by the NSF and Andrew W. Mellon Foundation. Indiana University November 5 th , 2009 School of Informatics and Computing

Upload: bisg

Post on 28-Nov-2014

2.218 views

Category:

Education


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: ALA 2010 -- Johan Bollen

The MESUR project: an overview and update

Johan BollenJohan BollenIndiana University

School of Informatics and ComputingCenter for Complex Networks and System Researchp y

[email protected]

Acknowledgements:Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Ryan Chute (LANL),

Lyudmila L. Balakireva (LANL), Aric Hagberg (LANL), Luis Bettencourt (LANL)

Research supported by the NSF and Andrew W. Mellon Foundation.

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 2: ALA 2010 -- Johan Bollen

When the obvious is staring you in the face

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 3: ALA 2010 -- Johan Bollen

The scientific process: the importance of early indicators

(Egghe & Rousseau, 2000; Wouters, 1997)(Brody, Harnad, & Carr 2006),

Citation: final products• Publication delays

Usage data• Scale, cf. Elsevier downloads

(+1B) vs Wos citations (650M)• Focus on publications• Focus on authors

(+1B) vs. Wos citations (650M)• Immediate, early stages• Variety of resources and actors

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 4: ALA 2010 -- Johan Bollen

What is MESUR?

MESUR IS: Scientific project to study Science itself from REAL-TIME indicators.

Foundations:• Very large-scale, representative usage data (10^9)• Network science and social network analysis• Complex systems complex networksComplex systems, complex networksOutcomes• Surveys of novel impact metrics and their properties• Network models of Science

MESUR IS NOT:Commercial endeavor, end-user service development, promoter of

Unrelated picture of my daughter (who wants to be a scientist)

particular impact metrics, enemy nor friend of particular metrics, advocacy group, …

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 5: ALA 2010 -- Johan Bollen

Timeline and development

• 2006-2008: o Andrew W. Mellon Foundationo Digital Library Research and Prototyping team Los Alamos Nationalo Digital Library Research and Prototyping team, Los Alamos National

Laboratoryo Collection of large-scale usage data from some of world’s most

significant publishers, aggregators and institutional consortiao Feasibility: Usage data, usage-based network models of science,

usage-based impact metrics• 2009 – infinity and beyond:

NSF f di (S iSIP 2009 2012)o NSF funding (SciSIP, 2009-2012)o Indiana University, School of Informatics and Computing

• 2010: Andrew W. Mellon foundationContinuation of MESUR data collection and scientific worko Continuation of MESUR data collection and scientific work

o Investigate evolving to sustainable, open, community-supported infrastructure

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 6: ALA 2010 -- Johan Bollen

Presentation structure

1. MESUR’s Usage reference data setg2. Mapping scientific activity3. Metrics survey4. Future research4. Future research5. Discussion

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 7: ALA 2010 -- Johan Bollen

Creating the MESUR usage reference data set

1B

2006-2008: Collaborating publishers, aggregators and institutional consortia:• BMC, Blackwell, UC, CSU (23), EBSCO, ELSEVIER, EMERALD, INGENTA, JSTOR,BMC, Blackwell, UC, CSU (23), EBSCO, ELSEVIER, EMERALD, INGENTA, JSTOR,

LANL, MIMAS/ZETOC, THOMSON, UPENN (9), UTEXAS• Scale:

o > 1,000,000,000 usage events, and growing…50M ti l 100 000 i lo +50M articles, +-100,000 serials

• Period: 2002-2007, but mostly 2006

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 8: ALA 2010 -- Johan Bollen

Data normalization and ingestion

Minimal requirements for all usage data• Unique usage events (article level)• Fields: unique session ID, date/time, unique document ID and/or metadata,

request typerequest type• Note difference with usage statistics

2007 9 1 0 0 1 CFA cffoe A172080.N1.Vanderbilt.Edu unknown AST A 1996SPIE.2828..64S http://foe.edu/abs/1996SPIE.2828..64S http://www.google.com2007 9 1 0 0 1 CFA cffoe 210.94.41.89 unknown PHY A 2007ApPhL.90a2120C http://foe.edu/abs/2007ApPhL.90a2120C http://www.google.co.kr2007 9 1 0 0 1 CFA cffoe 24-196-228-125.dhcp.gwnt.ga.charter.com unknown AST A 2000ASPC.213.333S http://foe.edu/abs/2000bioa.conf.333S http://scholar.go2007 9 1 0 0 4 CFA cffoe 163.152.35.114 4700387eae PHY A 1993WRR..29.133S http://foe.edu/abs/1993WRR..29.133S http://scholar.google.com2007 9 1 0 0 6 CFA cffoe pd9e980fc dip0 t ipconnect de 45f0c69881 AST X 2007AN 328 841H http://arXiv org/abs/0708 1863 http://foe edu2007 9 1 0 0 6 CFA cffoe pd9e980fc.dip0.t-ipconnect.de 45f0c69881 AST X 2007AN..328.841H http://arXiv.org/abs/0708.1863 http://foe.edu2007 9 1 0 0 1 CFA cffoe A172080.N1.Vanderbilt.Edu unknown AST A 1996SPIE.2828..64S http://foeabs.edu/abs/1996SPIE.2828..64S http://www.google.com2007 9 1 0 0 1 CFA cffoe 210.94.41.89 unknown PHY A 2007ApPhL.90a2120C http://foeabs.edu/abs/2007ApPhL.90a2120C http://www.google.co.kr2007 9 1 0 0 1 CFA cffoe 24-196-228-125.dhcp.gwnt.ga.charter.com unknown AST A 2000ASPC.213.333S http://foeabs.edu/abs/2000bioa.conf.333S http://schola2007 9 1 0 0 4 CFA cffoe 163.152.35.114 4700387eae PHY A 1993WRR..29.133S http://foeabs.edu/abs/1993WRR..29.133S http://scholar.google.com2007 9 1 0 0 6 CFA cffoe pd9e980fc.dip0.t-ipconnect.de 45f0c69881 AST X 2007AN..328.841H http://arXiv.org/abs/0708.1863 http://foeabs.edu2007 9 1 0 0 6 CFA cffoe foel25144.4u.com.gh 47002f8eda PHY A 2002AGUFM.S21A0965M http://foeabs.edu/abs/2002AGUFM.S21A0965M http://www.goo2007 9 1 0 0 6 CFA cffoe 66-215-171-214.dhcp.ccmn.ca.charter.com 4681d22a6f AST A 2001P&SS..49.657R http://foeabs.edu/cgi-bin/bib_query?bibcode=2001P%2007 9 1 0 0 7 CFA cffoe nat ptouser3 uspto gov unknown PHY A 2005ApPhL 86g2106M http://foeabs edu/abs/2005ApPhL 86g2106M http://www google com2007 9 1 0 0 7 CFA cffoe nat-ptouser3.uspto.gov unknown PHY A 2005ApPhL.86g2106M http://foeabs.edu/abs/2005ApPhL.86g2106M http://www.google.com2007 9 1 0 0 7 CFA cffoe cpe-71-65-25-115.ma.res.rr.com unknown PHY A 1980SPIE.205.153S http://foeabs.edu/abs/1980SPIE.205.153S http://www.google.com2007 9 1 0 0 7 CFA cffoe customer3491.pool1.unallocated-106-0.orangehomedsl.co.uk unknown PHY A 1983ElL..19.883V http://foeabs.edu/abs/1983ElL..19.883V2007 9 1 0 0 8 CFA cffoe Uranus.seas.ucla.edu 46672d96b2 PHY A 1966Phy..32.385K http://foeabs.edu/abs/1966Phy..32.385K http://www.google.com2007 9 1 0 0 9 CFA cffoe 75-121-173-37.dyn.centurytel.net 46cf1fd8a6 AST D 1984ApJS..56.257J http://vizier.cfa.edu/viz-bin/VizieR?-source=III/92/ http://foe2007 9 1 0 0 13 CFA cffoe foel17-18.kln.forthnet.gr unknown AST A 1987cosm.book...C http://foeabs.edu/abs/1987cosm.book...C http://www.google.gr2007 9 1 0 0 15 CFA cffoe hades.astro.uiuc.edu 46f707564d PRE A 2007arXiv0707.3146N http://foeabs.edu/abs/2007arXiv0707.3146N http://foeabs.edu2007 9 1 0 0 17 CFA cffoe ool-43554752.dyn.optonline.net unknown PHY A 2000PhTea.38.132K http://foeabs.edu/abs/2000PhTea.38.132K http://www.google.com2007 9 1 0 0 17 CFA cffoe c 68 33 176 222 hsd1 md comcast net unknown GEN A 1994RSPSB 256 177M http://foeabs edu/abs/1994RSPSB 256 177M http://w2007 9 1 0 0 17 CFA cffoe c-68-33-176-222.hsd1.md.comcast.net unknown GEN A 1994RSPSB.256.177M http://foeabs.edu/abs/1994RSPSB.256.177M http://w2007 9 1 0 0 19 CFA cffoe 74-36-139-46.dr02.brvl.mn.frontiernet.net unknown AST T 2002SPIE.4767.114W http://foeabs.edu/cgi-bin/nph-abs_connect?bibcode=202007 9 1 0 0 19 CFA cffoe c-76-16-53-120.hsd1.il.comcast.net 46f667b71b AST F 1916PA...24.613L http://articles.foeabs.edu/cgi-bin/nph-iarticle_query?1916PA2007 9 1 0 0 20 CFA cffoe 74-39-37-62.nas03.roch.ny.frontiernet.net unknown PHY E 2007JSTEd.tmp..29B http://dx.doi.org/10.1007/s10972-007-9067-2 http://fo2007 9 1 0 0 22 ANU bio-mirror uatu-virtual1.anu.edu.au 46f9e8f87f AST A 2006ApJ..647.128E http://foe.grangenet.net/abs/2006ApJ..647.128E http://foe2007 9 1 0 0 22 CFA cffoe fw.hia.nrc.ca 46f1531d59 AST A 2002P&SS..50.745H http://foeabs.edu/abs/2002P%26SS..50.745H http://foeabs.edu2007 9 1 0 0 22 CFA cffoe 24-117-0-220.cpe.cableone.net unknown AST A 1984BITA..15.268S http://foeabs.edu/abs/1984BITA..15.268S http://www.google.com2

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 9: ALA 2010 -- Johan Bollen

Presentation structure

1. MESUR’s Usage reference data setg2. Mapping scientific activity3. Metrics survey4. Future research4. Future research5. Discussion

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 10: ALA 2010 -- Johan Bollen

Data set: subset of MESUR

• Common time period:po March 1st 2006 - February 1st 2007 o Thomson Scientific (Web of Science),

Elsevier (Scopus), JSTOR, Ingenta, University of Texas (9 campuses, 6 y ( p ,health institutions), and California State University (23 campuses)

• 346 312 045 usage events346,312,045 usage events• 97,532 serials (many of which not

journals)

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 11: ALA 2010 -- Johan Bollen

How to generate a usage network.

Same session ~ documents relatedness• Same session, same user: common interest

F f d f• Frequency of co-occurrence = degree of relationship

• Normalized: conditional probability

Usage data is on article level:• Works for journals and articles• Anything for which usage was recorded

Note: not something we invented: association rule learning in data mining.Beer and diapers!Beer and diapers!

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 12: ALA 2010 -- Johan Bollen

Johan Bollen, Herbert Van de Sompel, Aric Hagberg,Luis Bettencourt, Ryan Chute, Marko A. Rodriguez, Lyudmila Balakireva. Clickstream data yields high-resolution maps of science. PLoS One, February 2009.

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 13: ALA 2010 -- Johan Bollen

Network science for impact metrics.

PageRankPageRank

PR(vi): PageRank of node vi

O(vj): out-degree of journal vj

N: number of nodes in networkL: dampening factor

Betweenness centrality

: Number of geodesics between vi and vj

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 14: ALA 2010 -- Johan Bollen

Presentation structure

1. MESUR’s Usage reference data setg2. Mapping scientific activity3. Metrics survey4. Future research4. Future research5. Discussion

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 15: ALA 2010 -- Johan Bollen

A variety of impact metrics

Note:• Metrics can be calculated

both on citation andboth on citation and usage data

• “Frequentist”o Citation and usage

rates• “Structural”

o Citation graph, e.g. 2005 JCR2005 JCR

o Usage graph, e.g. created by MESUR

• H-index, G-index, SJR, tWh t d th MEAN? etcWhat do they MEAN?

What facets of impact do they represent?Which are best suited?

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 16: ALA 2010 -- Johan Bollen

Set of metrics calculated on MESUR data set

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 17: ALA 2010 -- Johan Bollen

The MESUR Metrics Map

BETWEENNESS

PAGERANK(S)

USAGE METRICS

TOTAL CITES

Johan Bollen, Herbert Van de Sompel, Aric Hagberg and Ryan Chute. A Principal g g y pComponent Analysis of 39 Scientific Impact Measures. PLoS ONE, June 2009. URL: http://dx.plos.org/10.1371/journal.pone.0006022.

RATE METRICS

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 18: ALA 2010 -- Johan Bollen

Presentation structure

1. MESUR’s Usage reference data setg2. Mapping scientific activity3. Metrics survey4. Future research4. Future research5. Discussion

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 19: ALA 2010 -- Johan Bollen

Samples of future work (can be skipped)

• Longitudinal studies:o Network changes over time: collaboration with Carl Bergstrom (UW)

P di ti f i ti i d lk d lo Prediction of innovation using random walk models• Logistics:

o Expand existing data set: focus on standardization, repeatabilityo Establish continued funding, good home for projecto Establish continued funding, good home for projecto “Center” model: rather than data->scientists, scientists->data

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 20: ALA 2010 -- Johan Bollen

Animated maps: tracing bursts of scientific activityp g y

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 21: ALA 2010 -- Johan Bollen

Coordinated bursts

23

2

1

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 22: ALA 2010 -- Johan Bollen

MESUR Mapping and ranking services

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 23: ALA 2010 -- Johan Bollen

MESUR Mapping and ranking services

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 24: ALA 2010 -- Johan Bollen

MESUR Mapping and ranking services

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 25: ALA 2010 -- Johan Bollen

MESUR: the good ...

After 3 years of MESUR:

• Scientific exploration of metrics for scholarly evaluation• Creation of large-scale reference data set• Mapping science from the viewpoint of users: there is structure!pp g p• Variety of metrics that cover various aspects of scholarly impact and prestige• MESUR dataset contains many more pearls for future research• Foundation for future continued research program:p g

• Longitudinal studies• Models of collective behavior of scientists

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 26: ALA 2010 -- Johan Bollen

MESUR: the bad and the ugly …

Scalability of the approach:• Lengthy negotiations to obtain log data• No infrastructure standards (yet): Recording, aggregating,No infrastructure standards (yet): Recording, aggregating,

normalization, ingestion, de-duplication,…• No generally accepted policies: privacy, property, …• No census data: when is a sample large and representative enough?

Quality control:• Bots, Crawlers (detectable but never perfect)

Cheating manipulation (easier with usage statistics than network• Cheating, manipulation (easier with usage statistics than network metrics)

Acceptance:p• Network-based usage metrics require session information. This is

overlooked! As a result, will we end up with usage-based statistics only?• “As simple as possible, but not more simple!”

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 27: ALA 2010 -- Johan Bollen

Moving towards community involvement

“Registration is now open for "Scholarly Evaluation Metrics: Opportunities and

Challenges", a one-day NSF-funded workshop that will take place in the Renaissance W hi t DC H t l W d d D b 16th 2009 P ti i ti i thi k hWashington DC Hotel on Wednesday, December 16th 2009. Participation in this workshop is limited to 50 people. Registration is free at http://informatics.indiana.edu/scholmet09/registration.html.

The topic of the workshop is the future of scholarly assessment approaches, including p p y pp , gorganizational, infrastructural, and community issues. The overall goal is to identify requirements for novel assessment approaches, several of which have been proposed in recent years, to become acceptable to community stakeholders including scholars, academic and research institutions, and funding agencies. The impressive group of speakers and panelists for the workshop includes representatives from each of these constituencies.

Further details are available at http://informatics.indiana.edu/scholmet09/announcement.html

Workshop organizers:Johan Bollen ([email protected]),Herbert Van de Sompel ([email protected]) andYing Ding ([email protected])

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 28: ALA 2010 -- Johan Bollen

Moving towards community involvement

Planning process underway to establish sustainable, open, community supported infrastructure.New support from Andrew W. Mellon foundation to figure it all out.

Logistics:Data aggregationNormalizationData-related servicesData management

Science:MetricsAnalysisPrediction

Data management

ServicesRankingAssessmentMapping

=More than sum of parts:• Each component supports the other• Various business and funding models• Generate added value on all levels

Can fundamentally change scholarly communication

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 29: ALA 2010 -- Johan Bollen

Some relevant publications.Johan Bollen, Herbert Van de Sompel, Aric Hagberg, Luis Bettencourt, Ryan Chute,

Marko A. Rodriguez, Lyudmila Balakireva. Clickstream data yields high-resolution maps of science. PLoS One, March 2009 (In Press)

Johan Bollen, Herbert Van de Sompel, Aric HagBerg, Ryan Chute. A principal component analysis of 39 scientific impact measures.arXiv.org/abs/0902.2183

Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030)( g )

Johan Bollen, Herbert Van de Sompel, and Marko A. Rodriguez. Towards usage-based impact metrics: first results from the MESUR project. In Proceedings of the Joint Conference on Digital Libraries, Pittsburgh, June 2008

Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries Vancouver June 2007Libraries, Vancouver, June 2007

Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. (cs.DL/0610154)

Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006.

Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics,Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006.

Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419-1440, 2005.

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing

Page 30: ALA 2010 -- Johan Bollen

Presentation structure

1. MESUR’s Usage reference data setg2. Mapping scientific activity3. Metrics survey4. Future research4. Future research5. Discussion

Indiana UniversityNovember 5th, 2009

School of Informatics and Computing