robot hunter, or, precisely what i thought i wouldn't be doing when i became a librarian -...

13
Leabharlann UCD An Coláiste Ollscoile, Baile Átha Cliath, Belfield, Baile Átha Cliath 4, Eire UCD Library University College Dublin, Belfield, Dublin 4, Ireland Robot hunter Or, precisely what I thought I wouldn’t be doing when I became a librarian Joseph Greene Research Repository Librarian [email protected] http://researchrepository.ucd.ie

Upload: conul-conference

Post on 14-Apr-2017

174 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

Leabharlann UCD

An Coláiste Ollscoile, Baile Átha Cliath,Belfield, Baile Átha Cliath 4, Eire

UCD Library

University College Dublin,Belfield, Dublin 4, Ireland

Robot hunterOr, precisely what I thought I wouldn’t

be doing when I became a librarian

Joseph GreeneResearch Repository [email protected]://researchrepository.ucd.ie

Page 2: Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

Counting downloads• Open Access repositories make science and

scholarship accessible, and we need to demonstrate our value

• Simple question: how often are these papers used? How many times have they been downloaded?

Page 3: Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

Enter the Robot

• At least 18% of web requests are from robots• Less than half can be accounted for by the five

main search engines• At Research Repository UCD, 2/3rds of our

repository’s downloads are marked as web robots

Page 4: Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

What are you talking about?Internet robot, Web robot, automated agent, crawler, spider, bot: any programme that visits websites and systematically retrieves information from them

Page 5: Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

Good and bad• Search engines, link verifiers, computer science

experiments• Gathering content for spam, phishing and copycat

sites, artificially improving a website’s ranking (spamdexing), looking for security holes, DDoS attacks…………

Page 6: Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

‘And the noisy, nasty nuisance grew, ‘til the villagers cried, “What can we do?”’

Detection methods:• Blocking robots in real-time:

Turing tests• Detecting later and removing

from statistics

Page 7: Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

Appropriate, but problematic methods for repositories• Excluding known robots by user-agent name

– Easily faked or omitted• Excluding by IP address

– DHCP, and list is growing exponentially• Usage pattern analysis: query rate and resources

requested– Expensive to automate

• Machine learning: training decision trees, neural nets and/or statistical systems– Did you say expensive???

• Combined approaches

Page 8: Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

Effectiveness, and repository out-of-the-box repository strategies

StrengthRobots detected by Recall (%) Precision (%)No images requested 98.34 75.48No referring site 96.27 52.25List of IP addresses 69.29 99.40HEAD method to access site 32.37 100.00Agent name declared 26.56 100.00Access only at night 24.48 50.43Robots.txt file accessed 17.01 100.00Time, σ (3s) 2.49 100.00Time, average (1s) 2.49 75.00

DSpace uses IP addresses of known agents – much weaker than

in the benchmarking study

Page 9: Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

Effectiveness, and repository out-of-the-box repository strategies

StrengthRobots detected by Recall (%) Precision (%)No images requested 98.34 75.48No referring site 96.27 52.25List of IP addresses 69.29 99.40HEAD method to access site 32.37 100.00Agent name declared 26.56 100.00Access only at night 24.48 50.43Robots.txt file accessed 17.01 100.00Time, σ (3s) 2.49 100.00Time, average (1s) 2.49 75.00

Eprints filters based on number of hits from an IP address per day –

similar to time based strategies in the benchmarking study

Page 10: Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

Effectiveness, and repository out-of-the-box repository strategies

StrengthRobots detected by Recall (%) Precision (%)No images requested 98.34 75.48No referring site 96.27 52.25List of IP addresses 69.29 99.40HEAD method to access site 32.37 100.00Agent name declared 26.56 100.00Access only at night 24.48 50.43Robots.txt file accessed 17.01 100.00Time, σ (3s) 2.49 100.00Time, average (1s) 2.49 75.00

Page 11: Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

Centralised strategy: IRUS-UK

• Collects and filters statistics from 84 DSpace and Eprints repositories

• COUNTER compliant usage statistics• Robot exclusion:

– The COUNTER list of agent names– All downloads from IP addresses where there are

more than 200 downloads in a day from a repository

– Most downloads from IP addresses where there are more than 100 downloads in a day from a repository

• Work commissioned to investigate feasibility and approach to adaptive filtering based on usage behaviour

Page 12: Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

Sources by slide1 Bill Gosper's Glider Gun in action—a variation of Conway's Game of Life. Johan G.

Bontes. <https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life#/media/File:Gospers_glider_gun.gif>

3, 6, 7 Doran, D.; Gokhale, S.S. Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery (2011) 22:183-210. DOI:10.1007/s10618-010-0180-z

4 http://pixabay.com/static/uploads/photo/2015/05/31/12/09/wooden-791421_640.jpg5 Bad Robot Productions logo. 2001-2008.

<https://en.wikipedia.org/wiki/Bad_Robot_Productions#/media/File:Bad_Robot_Productions_logo.jpg>

6 Burroway, J., Loard, J. V. The Giant Jam Sandwich. 1972, Houghton Mifflin Harcourt.8, 9, 10 Nick Geens, Johan Huysmans, Jan Vanthienen. Evaluation of Web Robot

Discovery Techniques: A Benchmarking Study. Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining. Lecture Notes in Computer Science 4065, pp 121-130, 2006. DOI:10.1007/11790853_10

8 Diggory, Mark. SOLR Statistics. DSpace Wiki. <https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics>

9 Joint, Nicholas. [EP-tech] Re: Please change the way IRstats works. Eprints_tech mailing list 2011-10-13 <http://www.eprints.org/tech.php/15695.html>

11 IRUS-UK. <http://www.irus.mimas.ac.uk/participants/>

Page 13: Robot Hunter, or, precisely what I thought I wouldn't be doing when I became a librarian - Joseph Greene

Thank you!