capturando a web com scrapy
TRANSCRIPT
![Page 1: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/1.jpg)
Capturando a Web com Scrapy
Gabriel Freitas
![Page 2: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/2.jpg)
O que é um web crawler?
![Page 3: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/3.jpg)
O que é um web crawler?
• “Web crawler é um programa de computador que navega pela World Wide Web de uma forma metódica e automaAzada”hEp://pt.wikipedia.org/wiki/Web_crawler
![Page 4: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/4.jpg)
Pra que serve?
![Page 5: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/5.jpg)
Pra que serve?
![Page 6: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/6.jpg)
Estrutura básica de um web crawler
![Page 7: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/7.jpg)
Estrutura básica de um web crawler
• Construção de requisições HTTP
![Page 8: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/8.jpg)
Estrutura básica de um web crawler
• Construção de requisições HTTP • Tratamento da resposta – Composição de objetos – Composição de novas requisições
![Page 9: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/9.jpg)
Estrutura básica de um web crawler
• Construção de requisições HTTP • Tratamento da resposta – Composição de objetos – Composição de novas requisições
• Persistência de dados
![Page 10: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/10.jpg)
Crawleando em Python
• Tecnologias comuns:
![Page 11: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/11.jpg)
Crawleando em Python
• Tecnologias comuns: – urllib, hEplib2, requests
![Page 12: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/12.jpg)
Crawleando em Python
• Tecnologias comuns: – urllib, hEplib2, requests – beauAfulsoup ou lxml
![Page 13: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/13.jpg)
Crawleando em Python
• Tecnologias comuns: – urllib, hEplib2, requests – beauAfulsoup ou lxml – json, mysql, xml, csv, sqlite, etc.
![Page 14: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/14.jpg)
Mas qual o problema?
![Page 15: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/15.jpg)
Mas qual o problema?
• Ter que resolver tudo na mão: – Se Aver autenAcação? – Trabalhar com sessão, cookie... – HTML mal formatado – Requisições Simultâneas – Aumento de pontos de falha – Etc..
![Page 16: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/16.jpg)
Solução?
![Page 17: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/17.jpg)
Solução?
![Page 18: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/18.jpg)
Instalando…
• $ pip install Scrapy
![Page 19: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/19.jpg)
Escolhendo o alvo..
![Page 20: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/20.jpg)
Criando o projeto
• $ scrapy startproject <nome projeto>
![Page 21: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/21.jpg)
Localizando os dados
![Page 22: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/22.jpg)
Localizando os dados
• hEp://www.ufc.com/fighter
![Page 23: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/23.jpg)
hEp://www.ufc.com/fighter/ronda-‐Rousey
![Page 24: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/24.jpg)
hEp://www.ufc.com/fighter/ronda-‐Rousey
![Page 25: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/25.jpg)
Definindo os itens
• Itens são os campos que você irá pegar
![Page 26: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/26.jpg)
Definindo os itens
![Page 27: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/27.jpg)
IdenAficando os Xpaths
![Page 28: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/28.jpg)
Testando Xpaths
• $ scrapy shell hEp://www.ufc.com/fighter/ronda-‐Rousey – $ sel.xpath('//div[@id="fighter-‐breadcrumb"]/span/h1/text()').extract()
– [u'Ronda Rousey']
![Page 29: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/29.jpg)
Gerando o Spider
• $ scrapy genspider ufc hEp://ufc.com
![Page 30: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/30.jpg)
Gerando o Spider
• $ scrapy genspider ufc hEp://ufc.com • Tipos de Spiders: – basic – crawl – csvfeed – xmlfeed
![Page 31: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/31.jpg)
Definindo Xpaths no Spider
![Page 32: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/32.jpg)
Executando o crawler
• $ scrapy crawl ufc
![Page 33: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/33.jpg)
Exportando os resultados
• Em json – $ scrapy crawl ufc -‐o lutadores.json -‐t json
• Em csv – $ scrapy crawl ufc -‐o lutadores.csv -‐t csv
• Em xml – $ scrapy crawl ufc -‐o lutadores.xml -‐t xml
![Page 35: Capturando a web com Scrapy](https://reader034.vdocuments.net/reader034/viewer/2022042614/559e5aae1a28ab88708b45e6/html5/thumbnails/35.jpg)
Referências
• Nataliel Vasconcelos – Python Beach • hEp://pypix.com/python/build-‐website-‐crawler-‐based-‐upon-‐scrapy/
• hEp://www.slideshare.net/previa/scrapyfordummies-‐15277988
• hEp://www.slideshare.net/TheVirendraRajput/web-‐scraping-‐in-‐python
• hEp://www.slideshare.net/obdit/data-‐philly-‐scrapy • hEp://trumae.blogspot.com.br/2014/01/scrapy-‐bem-‐facinho.html