從蟒蛇到神龍 - 從 1 接關繼續打造爬蟲程式

=>

Half hour of code:

Joe @ Taichun.py 2016.01.09

• PyConTW HoC

•

•

• …

•

• HTTP / HTML / CSS / JS python

•

• DEMO

•

Crawler

CRAWLER

Crawler •

• JS

•

• JS

•

• JS

•

• JS

•

• BUG

•

•

HoC

Crawler •

• JS

•

• JS

•

• JS

•

• JS

•

• BUG

•

•

C 299

# # …

STEP 1: • 1.1

• whois

•

• Python whois module

• online service

• cmd tools

STEP 1: • 1.2

• robot.txt sitemap.xml

• ….

•

• HTTP GET

• Python robotparser module parse

•

STEP 1: • 1.3

•

•

• Python builtwith module

• Browser

STEP 1:

• 1.4 (optional)

•

•

• google: Kali Linux

https://speakerdeck.com/achudars/28-web-crawlers

STEP 2: • 2.0

• XD

• HoC

•

• Python requests module

• curl … httpie

• API

https://www.youtube.com/channel/UCHLnNgRnfGYDzPCCH8qGbQw

STEP 3: •

•

• pattern …

• regular expression

• Python re module regex101

•

•

http://regex101.com/

STEP 3: •

•

• BeautifulSoup lxml parse

• HoC

• BeautifulSoup parser : html.parser / lxml / lxml-xml / html5lib

•

• parser

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

STEP 3:

•

• BJ4

• http -b www.google.com | hxnormalize -x | hxselect -c 'title'

STEP 3: •

•

• scrapely “train / learn”

• scrapy =>

• scapy =>

•

•

… train http lib

•

• HTTP

• parse

STEP 1:

•

•

• View Source Code vs Element View (chrome)

STEP 2: • python

pyquery

– JS render

STEP 2 : •

•

• javascript implement

•

• JS render

• WebView

•

• headless

STEP 2 : • JS render

• WebView

• Python Binding

• PyQt or PySide … ( )

•

• Selenium Python

• headless

• Phantomjs( Casperjs) Slimerjs …

•

•

•

…

• solution

CAPTCHA =>

@

•

• img alt

• OCR

• pytesseract or pytesser

• xx learning + ….

• XD captcha

•

IP =>

@

• proxy

• Python

DER =>

@

• python threading /multiprocessing coroutine module

• browser automation

• cookies handoff

BLOCK @

@IE ONLY

@

SPIDER TRAP@

HEADLESS MODE JS EVENT

• crawler

•

• scrapy

XD

<=

1/24 14:00 GLIACLOUD DAVID

X

•

• K-12

• /

• /

• /

•

• GAE (python)

• backbone.js / react.js

• AWS

• SCRUM

從蟒蛇到神龍 - 從 1 接關繼續打造爬蟲程式

Software