=>
Half hour of code:
Joe @ Taichun.py 2016.01.09
•
• HTTP / HTML / CSS / JS python
•
• DEMO
•
STEP 1: • 1.2
• robot.txt sitemap.xml
• ….
•
• HTTP GET
• Python robotparser module parse
•
STEP 1: • 1.3
•
•
• Python builtwith module
• Browser
STEP 3: •
•
• pattern …
• regular expression
• Python re module regex101
•
•
STEP 3: •
•
• BeautifulSoup lxml parse
• HoC
• BeautifulSoup parser : html.parser / lxml / lxml-xml / html5lib
•
• parser
STEP 3:
•
• BJ4
• http -b www.google.com | hxnormalize -x | hxselect -c 'title'
STEP 3: •
•
• scrapely “train / learn”
• scrapy =>
• scapy =>
•
•
STEP 1:
•
•
• View Source Code vs Element View (chrome)
STEP 2 : • JS render
• WebView
• Python Binding
• PyQt or PySide … ( )
•
• Selenium Python
• headless
• Phantomjs( Casperjs) Slimerjs …
•
• img alt
• OCR
• pytesseract or pytesser
• xx learning + ….
• XD captcha
•
• python threading /multiprocessing coroutine module
• browser automation
• cookies handoff
1/24 14:00 GLIACLOUD DAVID
X