crawl operators’ workshop

12
Crawl Operators’ Workshop Roger G. Coram

Upload: moanna

Post on 06-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Crawl Operators’ Workshop. Roger G. Coram. Topics. ExternalGeoLocationDecideRule Sheets IpAddressSetDecideRule. ExternalGeoLocationDecideRule. Legal Deposit legislation passed in April 2013. The Legal Deposit Libraries (Non-Print Works) Regulations 2013: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Crawl Operators’ Workshop

Crawl Operators’ Workshop

Roger G. Coram

Page 2: Crawl Operators’ Workshop

www.bl.uk 2

Topics

• ExternalGeoLocationDecideRule

• Sheets– IpAddressSetDecideRule

Page 3: Crawl Operators’ Workshop

www.bl.uk 3

ExternalGeoLocationDecideRule

• Legal Deposit legislation passed in April 2013.

• The Legal Deposit Libraries (Non-Print Works) Regulations 2013:

– 18 (1) “…a work published on line shall be treated as published in the United Kingdom if:

• “(b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom.”

Page 4: Crawl Operators’ Workshop

www.bl.uk 4

Geolocation

• ExternalGeoLocationDecideRule requires:

– A list of ISO 3166-1 country-codes to be included in the crawl

• GB, FR, DE, etc.

– An Implementation of ExternalGeoLookupInterface.

Page 5: Crawl Operators’ Workshop

www.bl.uk 5

ExternalGeoLookupInterface

• Our implementation is based on MaxMind’s GeoLite2 database.

• Freely available under ‘Creative Commons Attribution-ShareAlike 3.0 Unported License’.

• Only ~30MB; can be held in memory.

Page 6: Crawl Operators’ Workshop

www.bl.uk 6

crawler-beans.cxml

<!-- GEO-LOOKUP: specifying location of external database. --> <bean id="externalGeoLookup" class="uk.bl.wap.modules.deciderules.ExternalGeoLookup"> <property name="database" value="/dev/shm/geoip-city.mmdb"/> </bean>

<!-- ... ACCEPT those in the UK... --> <bean id="externalGeoLookupRule" class="org.archive.crawler.deciderules.ExternalGeoLocationDecideRule"> <property name="lookup"> <ref bean="externalGeoLookup"/> </property> <property name="countryCodes"> <list> <value>GB</value> </list> </property> </bean>

Configuration example:

Page 7: Crawl Operators’ Workshop

www.bl.uk 7

Results

• Short test crawl (1,000,000 seeds) produced:– 89,500,755 URLs in total.

– 26,072 non-UK URLs which would not otherwise been in scope.

• 137 distinct hosts.

Page 8: Crawl Operators’ Workshop

www.bl.uk 8

IP-based Sheets

“Hi,

“I'm a senior system administrator for Webfusion / 123-reg.

“We're currently experiencing lots of requests from crawler1.bl.uk to sites hosted on 81.21.76.62 , this is part of our Parking platform, which links into Yahoo to allow customers to park domains and earn money.”

• Large number of hosts on a single machine.

• Need a way to reduce the load on a specific IP address.

Page 9: Crawl Operators’ Workshop

www.bl.uk 9

Sheets

• “Sheets provide the ability to replace default settings on a per domain basis.”

– Allow you to change any value on any named bean for a specific set of URLs.

• Actually quite flexible:– SurtPrefixesSheetAssociation

• Applied by matching SURT prefixes.

– DecideRuledSheetAssociation:

• Applied a series of DecideRules.

– IpAddressSetDecideRule

Page 10: Crawl Operators’ Workshop

www.bl.uk 10

1. crawler-beans.cxml

<bean id="extraPolite" class="org.archive.spring.Sheet"> <property name="map"> <map> <entry key="disposition.delayFactor" value="8.0"/> <entry key="disposition.minDelayMs" value="10000"/> <entry key="disposition.maxDelayMs" value="60000"/> <entry key="disposition.respectCrawlDelayUpToSeconds" value="60"/> </map> </property> </bean>

<bean id="crawlLimited" class="org.archive.spring.Sheet"> <property name="map"> <map> <entry key="quotaEnforcer.serverMaxFetchResponses" value="25"/> </map> </property> </bean>

Configuration example:

Page 11: Crawl Operators’ Workshop

www.bl.uk 11

2. crawler-beans.cxml

<bean class="org.archive.crawler.spring.DecideRuledSheetAssociation"> <property name="rules"> <bean class="org.archive.modules.deciderules.IpAddressSetDecideRule"> <property name="ipAddresses"> <set> <value>81.21.76.62</value> </set> </property> <property name="decision" value="ACCEPT"/> </bean> </property> <property name="targetSheetNames"> <list> <value>extraPolite</value> <value>crawlLimited</value> </list> </property> </bean>

Configuration example:

Page 12: Crawl Operators’ Workshop

www.bl.uk 12

Thank you

GitHub: https://github.com/ukwa/bl-heritrix-modulesMaxMind: http://dev.maxmind.com/geoip/geoip2/geolite2/