repository statistics peter millington technical development officer sherpa, university of...
TRANSCRIPT
Repository Statistics
Peter Millington
Technical Development Officer
SHERPA, University of Nottingham
Overview
Introduction
Global statistics
The what & why of repository statistics
Benchmarks & data sources
Compilation methods
Web usage logging tools
Google Analytics demo
Problems and solutions
Group session – Key issues
Global Repository Statistics
Data Sources – Global lists of repositories• OpenDOAR - http://www.opendoar.org/• ROAR - http://roar.eprints.org/• Repository66- http://www.repository66.org/
May be useful for advocacy work
Examples of types of chart & presentation
ROAR – Individual Growth Charts
ROAR – Individual Source Data
Month Records Archives200407 12200408 34200409 77200410 106200411 149200412 164200501 187200502 212200503 272200504 324200505 389200506 426200507 446200508 492200509 547200510 607200511 631200512 750200601 794200602 860200603 1019200604 1090200605 1128200606 1307
Month Records Archives200607 1347200608 1405200609 1469200610 1530200611 1610200612 1705200701 1768200702 1853200703 1934200704 2042200705 2169200706 2239200707 2264200708 2352200709 2374200710 2400200711 2438200712 2484200801 2540200802 2573200803 2611200804 2643200805 2681200806 2689
Delegates’ What and Why of Statistics
Rate of growth• For advocacy• Measure of success – for our paymasters
Rate of usage• Targeting weak areas – departments• Measure of success• Justifying funding
Most downloaded author/paper• Promotes interest and engagement from authors
Delegates’ What and Why of Statistics
Where are visitors coming from – referrers• Curiosity – is it being seen by the right people
Citation statistics• To demonstrate the beneficial impact of repositories
Drilling down for more detail• For a sense reality
Steep slopes, animation, etc• Glitzy marketing
Individual Repositories - Content
Growth & Deposition rates• Measure of progress• Impact of advocacy events• Impact of mandatory deposition
Types of document or item• Trend-watching?
Breakdown by department and/or author• How much is everyone contributing?
Proportion of full text v metadata only• Measure of usefulness
Item types: Universidade do Minho
Individual Repositories - Performance
Proportion of publications deposited• How comprehensive is the archive?
Proportion of authors who are depositing• Are they complying with local mandates?
Compliance with funders’ mandates• Are you meeting your obligations?
Repository administration• Are your turn round times acceptable?
Compliance with the CERN Mandate
Compliance Benchmarks
Counting publications• Institution-wide bibliographies
• e.g. Maintained by research managers
• Publication lists on departmental web pages• Public/Commercial databases – ISI, Medline, etc
Counting authors• Who qualifies as an author?
• Academic staff, Research students, Managers
• University Calendars & Departmental staff lists
Individual Repositories - Usage
Rates of usage• Measure of usefulness• Impact of news-related items
Most downloaded items• Identifying research(ers) with most impact?• Engendering competition between authors?
Downloads according to author• Performance reviews?
Geographical distribution of users• Are you reaching your intended audience?
Sources of Data
Repository’s own database
OAI-PMH
Server’s access log
Remote logging
Compilation Methods
Repository’s own database• Copying from the human interface• Interactive SQL commands
Copying from the Human Interface
Interactive SQL Commands
mysql> SELECT type,COUNT(*) FROM eprint GROUP BY type;
+-----------------+----------+| type | COUNT(*) |+-----------------+----------+| article | 456 || book | 5 || book_section | 39 || conference_item | 173 || exhibition | 1 || monograph | 18 || other | 3 || thesis | 4 |+-----------------+----------+8 rows in set (0.00 sec)
64%1%
6%
25%
0%3%0%1%
article
book
book_section
conference_item
exhibition
monograph
other
thesis
Compilation Methods
Repository’s own database• Copying from the human interface• Interactive SQL commands
OAI-PMH• Harvesting programs – e.g. ROAR’s Celestial
OAI-PMH ListIdentifiers
OAI-PMH ListRecords
ROAR - Celestial
date identifier url20070618 oai:bora.uib.no:1956/2270 Department of Earth Science20070625 oai:bora.uib.no:1956/2272 Department of History 20070625 oai:bora.uib.no:1956/2273 Department of the History of Religions 20070626 oai:bora.uib.no:1956/2274 Section for Endocrinology20070626 oai:bora.uib.no:1956/2275 Department of the History of Religions 20070626 oai:bora.uib.no:1956/2276 Department of the History of Religions 20070626 oai:bora.uib.no:1956/2277 Department of the History of Religions 20070626 oai:bora.uib.no:1956/2278 Department of the History of Religions 20070626 oai:bora.uib.no:1956/2279 Department of Oral Sciences20070626 oai:bora.uib.no:1956/2281 Department of the History of Religions 20070626 oai:bora.uib.no:1956/2282 Department of Sociology 20070626 oai:bora.uib.no:1956/2283 Else Æyen20070628 oai:bora.uib.no:1956/2284 Section for Art History20070629 oai:bora.uib.no:1956/2285 Section for Russian20070629 oai:bora.uib.no:1956/2286 Department of Geography20070629 oai:bora.uib.no:1956/2287 Department of Greek, Latin and Egyptology20070702 oai:bora.uib.no:1956/2288 Section for Spanish20070702 oai:bora.uib.no:1956/2289 Department of Mathematics20070702 oai:bora.uib.no:1956/2290 Department of Geography20070702 oai:bora.uib.no:1956/2291 Department of Geography20070702 oai:bora.uib.no:1956/2292 Department of Biology 20070703 oai:bora.uib.no:1956/2293 Department of Biology
Compilation Methods
Repository’s own database• Copying from the human interface• Interactive SQL commands
OAI-PMH• Harvesting programs – e.g. ROAR’s Celestial
Server’s access log• Web usage statistics tools
Raw Web Access Logs
209.237.238.179 - - [10/Apr/2005:05:34:06 +0100] "GET /portfolio.css HTTP/1.0" 200 816 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:16:27 +0100] "GET /DAWN_Index.htm HTTP/1.0" 200 8392 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:17:44 +0100] "GET /Eric.htm HTTP/1.0" 200 6975 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:21:12 +0100] "GET /Library_Form.htm HTTP/1.0" 200 7709 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:22:48 +0100] "GET /cleansing.htm HTTP/1.0" 200 11016 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:25:02 +0100] "GET /index.htm HTTP/1.0" 200 7613 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:28:19 +0100] "GET /integration.htm HTTP/1.0" 200 8027 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:31:35 +0100] "GET /merging.htm HTTP/1.0" 200 9132 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:07:34:39 +0100] "GET /publication.htm HTTP/1.0" 200 5327 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:08:22:38 +0100] "GET /ABACUS_Index.htm HTTP/1.0" 200 5421 "-" "ia_archiver"209.237.238.179 - - [10/Apr/2005:08:27:34 +0100] "GET /limitations.htm HTTP/1.0" 200 3781 "-" "ia_archiver"210.173.179.17 - - [20/Dec/2004:13:22:03 +0000] "GET /robots.txt HTTP/1.1" 404 - "-" "gazz/5.0 ([email protected])"210.173.179.17 - - [20/Dec/2004:13:23:51 +0000] "GET / HTTP/1.1" 200 7613 "-" "gazz/5.0 ([email protected])"210.173.179.17 - - [20/Dec/2004:13:25:34 +0000] "GET /Logo.gif HTTP/1.1" 200 3838 "-" "gazz/5.0 ([email protected])"210.173.179.17 - - [20/Dec/2004:13:27:17 +0000] "GET /contact.htm HTTP/1.1" 200 4626 "-" "gazz/5.0 ([email protected])"210.173.179.17 - - [20/Dec/2004:13:29:00 +0000] "GET /profile.htm HTTP/1.1" 200 10533 "-" "gazz/5.0
([email protected])"210.173.179.17 - - [20/Dec/2004:13:37:35 +0000] "GET /index.htm HTTP/1.1" 200 7613 "-" "gazz/5.0 ([email protected])"210.173.179.17 - - [20/Dec/2004:13:47:55 +0000] "GET /publication.htm HTTP/1.1" 200 5327 "-" "gazz/5.0
([email protected])"210.173.179.17 - - [20/Dec/2004:13:49:39 +0000] "GET /InsideInfo.jpg HTTP/1.1" 200 19372 "-" "gazz/5.0
Recorded fields include:• IP Address of the computer requesting a file• Date & time transaction completed• Name of file requested• Success code – usually 200 for “successfully completed”• File size in bytes
Web Usage Statistics Tools
Analog• http://www.analog.cx/
Webalizer• http://www.mrunix.net/webalizer/
AWStats• http://www.mrunix.net/webalizer/
etc.
Sample output from theAnalog Statistics Package
Sample output from theWebalizer Statistics Package
Sample output from theAWStats Statistics Package
Compilation Methods
Repository’s own database• Copying from the human interface• Interactive SQL commands
OAI-PMH• Harvesting programs – e.g. ROAR’s Celestial
Server’s access log• Web usage statistics tools
Remote logging• Google Analytics
Google Analytics
http://www.google.com/analytics
Sign up to a Google Account
Specify the URL to be logged
Obtain snippet of JavaScript code
Insert snippet into HTML of pages to be logged• Ideally into a template file• Make sure the modified pages are live!
Logging starts automatically
Log in to your account to view the analytics
Google Analytics
JavaScript snippet <script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
var pageTracker = _gat._getTracker("UA-3477654-3");
pageTracker._initData();
pageTracker._trackPageview();
</script>
Find URL Containing/Excluding• String
• e.g. “pdf”
• Regular expressions• e.g. /[0-9]*/ for EPrints IDs
Problems
Web bots and crawlers• Inflating usage volume• Scewing usage time series
Auxiliary files & non-eprint pages• CSS style sheet files• Image files – jpeg, gif, etc.• Index pages
Linking URLs to bibliographic references• What does that eprint number mean?
Problems and Solutions
Web bots and crawlers• Use robots.txt & meta robots tags to prevent crawling• Filtering out known bots• Still leaves maverick hackers’ & students’ bots
Auxiliary files & non-eprint pages• Configuring & tuning the analysis tool• Filter using ‘regular expressions’
Linking URLs to bibliographic references• Programmatic concordance• e.g. IRStats
Over to Chris for DSpace statistics…
What are your priorities for statistics?
Peter Millington