scrapy talk at dataphilly

22
Scrapy Patrick O’Brien | @obdit DataPhilly | 20131118 | Monetate

Upload: obdit

Post on 13-May-2015

1.861 views

Category:

Technology


1 download

DESCRIPTION

Tour of the architecture and components of Scrapy.

TRANSCRIPT

Page 1: Scrapy talk at DataPhilly

ScrapyPatrick O’Brien | @obdit

DataPhilly | 20131118 | Monetate

Page 2: Scrapy talk at DataPhilly

Steps of data science

● Obtain● Scrub● Explore● Model● iNterpret

Page 3: Scrapy talk at DataPhilly

Steps of data science

● Obtain Scrapy● Scrub● Explore● Model● iNterpret

Page 4: Scrapy talk at DataPhilly

About Scrapy

● Framework for collecting data● Open source - 100% python● Simple and well documented● OSX, Linux, Windows, BSD

Page 5: Scrapy talk at DataPhilly

Some of the features

● Built-in selectors● Generating feed output

○ Format: json, csv, xml○ Storage: local, FTP, S3, stdout

● Encoding and autodetection ● Stats collection● Control via a web service● Handle cookies, auth, robots.txt, user-agent

Page 6: Scrapy talk at DataPhilly

Scrapy Architecture

Page 7: Scrapy talk at DataPhilly

Data flow1. Engine opens, locates Spider, schedule first url as a Request2. Scheduler sends url to the Engine, which sends it to Downloader3. Downloader sends completed page as a Response through the

middleware to the engine4. Engine sends Response to the Spider through middleware5. Spiders sends Items and new Requests to the Engine6. Engine sends Items to the Item Pipeline and Requests to the Scheduler7. GOTO 2

Page 8: Scrapy talk at DataPhilly

Parts of Scrapy

● Items● Spider● Link Extractors● Selectors● Request● Responses

Page 9: Scrapy talk at DataPhilly

Items

● Main container of structured information● dict-like objects

from scrapy.item import Item, Field

class Product(Item): name = Field() price = Field() stock = Field() last_updated = Field(serializer=str)

Page 10: Scrapy talk at DataPhilly

Items>>> product = Product(name='Desktop PC', price=1000)

>>> print product

Product(name='Desktop PC', price=1000)

>>> product['name']

Desktop PC

>>> product.get('name')

Desktop PC

>>> product.keys()

['price', 'name']

>>> product.items()

[('price', 1000), ('name', 'Desktop PC')]

Page 11: Scrapy talk at DataPhilly

Spiders

● Define how to move around a site○ which links to follow○ how to extract data

● Cycle○ Initial request and callback○ Store parsed content○ Subsequent requests and callbacks

Page 12: Scrapy talk at DataPhilly

Generic Spiders

● BaseSpider● CrawlSpider● XMLFeedSpider● CSVFeedSpider● SitemapSpider

Page 13: Scrapy talk at DataPhilly

BaseSpider

● Every other spider inherits from BaseSpider● Two jobs

○ Request `start_urls`○ Callback `parse` on resulting response

Page 14: Scrapy talk at DataPhilly

BaseSpider...

class MySpider(BaseSpider):

name = 'example.com'

allowed_domains = ['example.com']

start_urls = [

'http://www.example.com/1.html',

'http://www.example.com/2.html',

'http://www.example.com/3.html',

]

def parse(self, response):

sel = Selector(response)

for h3 in sel.xpath('//h3').extract():

yield MyItem(title=h3)

for url in sel.xpath('//a/@href').extract():

yield Request(url, callback=self.parse)

● Send Requests example.com/[1:3].html

● Yield title Item

● Yield new Request

Page 15: Scrapy talk at DataPhilly

CrawlSpider

● Provide a set of rules on what links to follow○ `link_extractor`○ `call_back`

rules = (

# Extract links matching 'category.php' (but not matching 'subsection.php')

# and follow links from them (since no callback means follow=True by default).

Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

# Extract links matching 'item.php' and parse them with the spider's method parse_item

Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),

)

Page 16: Scrapy talk at DataPhilly

Link Extractors

● Extract links from Response objects● Parameters include:

○ allow | deny○ allow_domains | deny_domains○ deny_extensions○ restrict_xpaths○ tags○ attrs

Page 17: Scrapy talk at DataPhilly

Selectors

● Mechanisms for extracting data from HTML● Built over the lxml library● Two methods

○ XPath: sel.xpath('//a[contains(@href, "image")]/@href' ).

extract()

○ CSS: sel.css('a[href*=image]::attr(href)' ).extract()

● Response object is called into Selector○ sel = Selector(response)

Page 18: Scrapy talk at DataPhilly

Request

● Generated in Spider, sent to Downloader● Represent an HTTP request● FormRequest subclass performs HTTP

POST○ useful to simulate user login

Page 19: Scrapy talk at DataPhilly

Response

● Comes from Downloader and sent to Spider● Represents HTTP response● Subclasses

○ TextResponse○ HTMLResponse○ XmlResponse

Page 20: Scrapy talk at DataPhilly

Advanced Scrapy

● Scrapyd○ application to deploy and run Scrapy spiders○ deploy projects and control with JSON API

● Signals○ notify when events occur○ hook into Signals API for advance tuning

● Extensions○ Custom functionality loaded at Scrapy startup

Page 21: Scrapy talk at DataPhilly

More information

● http://doc.scrapy.org/● https://twitter.com/ScrapyProject● https://github.com/scrapy● http://scrapyd.readthedocs.org

Page 22: Scrapy talk at DataPhilly

Demo