cornell cs 502 scholarly communication ii citation analysis and reference linking cs 502 –...

36
Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Post on 19-Dec-2015

222 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Scholarly Communication IICitation Analysis and Reference Linking

CS 502 – 20020430Carl Lagoze – Cornell University

Page 2: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Recalling the themes

• Basic assumptions are broken– Expensive distribution– Distinctions between publishers, authors, readers

• Basic assumptions remain– Need for quality– Need for people to make money– Reward system: tenure and promotion

• Changing context– Readers habits– Increase of scholarly output– Computing and network power

Page 3: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502 Acks. P. Ginsparg

Unbundling content from services

Page 4: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Signs of Change - Readers

… there’s a sense in which the journal articles prior to the inception of the electronic abstracting and indexing database may as well not exist, because they are so difficult to find. Now that we are starting to see … full-text showing up online, I think we are very shortly going to cross a sort of critical mass boundary where those publications that are not instantly available in full-text will become kind of second-rate in a sense, not because their quality is low, but just because people will prefer the accessibility of things they can get right away.

Clifford Lynch 1997

Page 5: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Signs of Change - Publishers

• Electronic versions of existing journals• Licensing arrangements to libraries

– http://campusgw.library.cornell.edu/cgi-bin/dj.cgi?section=ejournal&URL=SerialsSearch

• Problems– License bundling

• Inflate costs and maintain economic model• Force libraries to subscribe regardless of interest

– Longevity dependent on license continuity

Page 6: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Signs of Change - Publishers

• Electronic Journals– D-Lib Magazine – http://www.dlib.org – Journal of Digital Information (JODI) –

http://journals.ecs.soton.ac.uk/jodi/– Journal of Electronic Publishing (JEP) –

http://www.press.umich.edu/jep/

• The economic models are not established

Page 7: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Signs of Change – Publishers and Libraries

• JSTOR– http://www.jstor.org

• Recognition of reality– Archival journal storage is expensive for libraries

• Shelf space crisis forces libraries to choose between– Keeping archival issues to serials– Continuing subscriptions for new issues– Building expensive new buildings

– Archival copies have limited economic value to publishers

• Cooperative non-profit model among publishers/foundation (Mellon)/libraries

• Sliding window to digitize old issues of serials and provide ready access services

Page 8: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Signs of Change – Libraries & Professional Societies

• HighWire Press – http://highwire.stanford.edu• Realities

– Many professional societies and journals are “Mom & Pop” operations

– Technical and economic cost of electronic publishing is often prohibitively high

• Solution– Highwire acts as a brokering service to provide

electronic publishing technology for small professional societies and journals

– Pooling technology allows creation of higher level services (e.g., reference linking amongst journals)

Page 9: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Signs of Change - Scholars

• Eprint respositories– Author-self archiving gives scholars control over their

intellectual output– Harnad’s “subversive proposal”– Direct descendant of traditional pre-print sharing in

print form among scholars

• Examples– arXiv – http://arxiv.org– ePrints – http://www.eprints.org – California Digital Library scholarly publishing archive

- http://repositories.cdlib.org/

• Related Issues– Publisher agreements – some journals refuse to

publish anything that has been posted as an eprint

Page 10: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Signs of Change – Computer Scientists

• Automatic creation of traditional journal services– Citation analysis– Reviewing

Page 11: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Concepts – References and Citations

Doc1

Doc2

Doc3

Doc1 references:

(Doc2, Doc3)

Doc1 citations:

(Doc2)

Page 12: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Concepts – References and Citations

• # of references of a document is finite, stable, and easy to determine/compute

• # of citations of a document is dynamic, impossible to computer (infinite)

• Generally, references are at the work, or manifestation level, NOT at the item level

Page 13: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Citation Analysis

• Understanding citation patterns among scholarly journals– Quality metric– Cost/benefit analysis – what “basic” journals should a

library have in its holdings

• Eugene Garfield – “Father of citation analysis”• Science Citation Index

– Origins circa 1950’s– Hand analysis of printed journals showing patterns of

citations into and out from journals

Page 14: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Results of citation analysis

acks. Garfield, Science, 1972

Page 15: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Citation analysis in the digital age

• Automatic citation linking among papers in arXiv– Citebase (Open Citations Project)– http://citebase.eprints.org/cgi-bin/search?submit=1&

author=Hawking%2C%20S%20W%20

• Scientometrics - Automation of methods reveals lots of data– Longevity of interest in paper– Journal and ePrint citation patterns

• Automatic citation analysis as a reviewing tool?

Page 16: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Are papers downloaded then cited or cited then downloaded?(2)

• If all these time differences are plotted the above graph is produced.

What came first the Citation or the Download

0

1000

2000

3000

4000

5000

6000

7000

-300 0 300 600 900 1200 1500 1800 2100 2400 2700

Age of Paper at Download minus Age of Paper at Citation

Fre

qu

ency

Acks: S. Harnad

Page 17: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Citation Latencies

• The raw data show that the latency of the citation peak has been reducing over the period of the archive

Frequency of Citation Latencies: 1992-1999

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 12 24 36 48 60 72 84 96

Time Difference/Months

Cita

tions

99 98 97 96 95 94 93 92

Acks: S. Harnad

Page 18: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Author Impact Quartiles

• High impact authors update more than medium or low• High and medium impact authors deposit more papers than low

Quartile Total % Total Citations PapersCitations/Aut

hor/PaperDeposits

Mean Updates/Author

High 25% 798 2.09% 240,092 2,732 0.11 6,720 0.48Med 50% 9,262 24.20% 733,272 37,318 0.00212 93,671 0.37Low 25% 28,211 73.71% 251,925 67,951 0.000131 165,971 0.27

Acks: S. Harnad

Page 19: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Citation Quality

• Papers generally cite papers of like impact

High

Medium

Low

LowMedium

High

0

20000

40000

60000

80000

100000

120000

140000

No of Citations

Dest. Impact

Source Impact

Do Papers Cite Papers of Like Impact

Acks: S. Harnad

Page 20: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Histogram of Citations per Paper(author impact) 30,000 papers were by authors w ith no citation

1386534 6072 5863

9627

30807

13668 11527

6784

3105

1797121 24925717047814441

2060

0

5000

10000

15000

20000

25000

30000

35000

40000

No citations 1 Citation 2/3 Citations 4/5/6Citations

7/8/9/10Citations

11 or moreCitations

Pap

ers

High (2.53%) Medium (34.55%) Low (62.92%)

Citation Spread

• A small number of papers receive a very large number of citations

Acks: S. Harnad

Page 21: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

How Paper Impact Effects Usage

• Higher impact papers have a longer download life expectancy.

All Papers

0

0.0005

0.001

0.0015

0.002

0.00250

109

218

327

436

545

654

763

872

981

1090

1199

1308

1417

1526

1635

1744

1853

1962

2071

2180

2289

2398

Age of paper (days)

Fre

qu

ency

Den

sity

High (2.0%) Medium (7.7%) Low (46.5%) Unknown (39.6%)

Acks: S. Harnad

Page 22: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

What is the correlation between citations and downloads?

• There is a significant positive correlation between citations and downloads for high impact papers.

Download type r nAll Papers 0.11155 63671

High Impact Papers (2.0%) 0.27293 1981Medium Impact Papers (7.7%) 0.01288 5937

Low Impact Papers (46.5%) -0.01412 30163

Acks: S. Harnad

Page 23: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

full text

reference linking

Page 24: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Who is Who

Books in Print

Amazon.com

extended services

Page 25: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Static Linking

• Fixed URLs in references• All the associated problems with URLs • Persistent link through document “footprint”

– Robust URLs – Berkeley

Page 26: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

General Idea: Enhance URLs with “signatures”

• Add to a URL a “signature”, a small piece of document content.

• When “traditional” (i.e., address-based) dereferencing fails, do “signature-(i.e., content-)based dereferencing: – Pass the signature to some search service, and hope that the

target will be prominent among a very small result set.• Two issues:

– Computing small, yet effective and robust signatures– Adding them innocuously to hyperlinks

Acks: Phelps/Wilensky

Page 27: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Computing Small, Robust Signatures

• “Lexical” signatures: The top n words of a document chosen for rarity, subject to heuristic filters to aid robustness.– “a TF-IDF-like” measure

• Easy to compute and use.• Question: How big a signature is needed to locate a

document more or less uniquely on the Web?– Inktomi says there are approximately 1 billion web pages

now.

Acks: Phelps/Wilensky

Page 28: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Answer: 5 words!

• I.e., a signature of 5 words will, in most cases, cause search engines to return the target document within the top few hits.

• Actually, a smaller signature will probably do just to locate exact matches, but length helps provide robustness and for growth.

• Martin and Holte (1998) and Bharat and Broder (1998) demonstrate summary queries and strong queries, resp., which use rarity of words (and, possibly, phrases) to local specific documents.

– Our variation focuses on robustness + rarity.

Acks: Phelps/Wilensky

Page 29: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Some Examples

• Signature for Randy Katz’s home page was – “Californa ISRG Culler rimmed gaunt”

• Here is what happens when we feed this signature to HotBot:

Gives same result after correcting typo.

Acks: Phelps/Wilensky

Page 30: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Another Example

• Signature for Endeavour home page is – “amplifies Endeavour leverages Charting Expedition”

• Here is what happens when we feed this signature to Google:

Acks: Phelps/Wilensky

Page 31: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Why Does This Work?

• If terms were distributed independently, the probability of 5 even moderately common terms occurring in more than one document is very small.

– In fact, picking 3 terms restricted to those occurring in 100,000 documents works pretty well.

– Many documents contain very infrequently used words.– There is lots of room for independence to be off, and to play with term

selection for robustness, etc..

Acks: Phelps/Wilensky

Page 32: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Persistent Linking

• CrossRef– Uses Digital Object Identifiers (DOIs)

• A type of URN (Handle)

– Cooperative agreement among publishers

• Publishers control the resolution mechanism– Can go to full-text, other services, or charging

mechanism

• Example– http://www.crossref.org/demos/springer/springer.htm

Page 33: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Dynamic & Context Sensitive Linking

• Problem – Link behavior should not be the same for all

• Solution - Link contains metadata rather than an identifier

• OpenURL – standard for incorporating metadata into a URL– http://sfx.aaa.edu/menu?genre=article&issn=1234-

5678&volume=12&issue=3&spage=1&epage=8&date=1998&aulast=Smith&aufirst=Paul

• SFX– System for locally resolving an OpenURL to extended and

localized services

– http://www.sfxit.com/

Page 34: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Researchindex – automatic interlinking on the web

• http://researchindex.org• Selective web crawling to gather CS resources• Heuristics and AI techniques to establish

services– Searching– Reference linking

• Why do we need metadata for textual documents?

Page 35: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Automatic Reviewing Techniques

• Traditional Collaborative Filtering– Estimate what score a reviewer might give to an

item that he/she has not scored yet – Frequently used by recommender systems

• Collaborative quality filtering– http://www.cs.berkeley.edu/~tracyr/project/ – Attempts to automatically determine which

reviewers are "good" in an open reviewing system, in order to provide the same (or better) benefits as peer review

Page 36: Cornell CS 502 Scholarly Communication II Citation Analysis and Reference Linking CS 502 – 20020430 Carl Lagoze – Cornell University

Cornell CS 502

Collaborative Quality Filtering Algorithm

• Assume true value of an item is the asymptotic average of review scores

• Good reviewers are those who consistently predict this average

• Normalize according to # of reviews of an item, # of reviews by reviewer, review latency

• Adjust by “expertise” – Use similarity of term vectors of items reviewed