web scraping - d2wakvpiukf49j.cloudfront.net · scrapy python lib runs requests in parallel automa...
TRANSCRIPT
WEB SCRAPINGWEB SCRAPINGBy Ben Keith
Quoin Inc.
Oct. 13, 2016
WHAT IS SCRAPING?WHAT IS SCRAPING?Programa�cally collec�ng useful data from a website that does not have a machine-op�mized interface
Scraping is the inverse of rendering an HTML template with data
Normal Flow: data + template -> renderer -> HTML
Scraping: HTML -> scraper -> data + [template]
Look for APIs first
CRAWLINGCRAWLINGTraversing a website for the purpose of mirroring/scrapingDone by a crawler/spider
SIMPLEST CASESIMPLEST CASEWell-structured sta�c HTML served directly by the web server
List/detail pa�ernRequirements for doing it in code:
HTTP clientHTML Parser (or regex if it's very simple)CSS/XPath query library (very helpful)
Example: Quoin website news
POTENTIAL PROBLEMSPOTENTIAL PROBLEMSBrowsers are very permissive with the markup they accept
Solu�on: use an equally permissive HTML parsing library
MORE DIFFICULT CASEMORE DIFFICULT CASEWell-structured dynamic HTML generated by Javascript
Ini�al page request won't return data to scrapeRequirements:
Download/render page using JS/CSSParse/Query resul�ng HTMLBasically what a browser does
Example: Vine trends pageIf it's JS-rendered, you can quite possibly use AJAX endpoints
Bypass a lot of scraping
PROBLEM: POORLY-STRUCTURED DATAPROBLEM: POORLY-STRUCTURED DATANeeds custom logic to get high quality datah�ps://www.ncsu.edu/directory/departmental/
ADVANCED TECHNIQUESADVANCED TECHNIQUESUse AI tools to automa�cally scrape important content from pagesData MiningFind pa�erns of repea�ng structure
What differs between themWhat is boilerplate
SEMANTIC WEB (WEB 3.0)SEMANTIC WEB (WEB 3.0)
POPULAR TOOLS/LIBRARIESPOPULAR TOOLS/LIBRARIESScrapy
Python libRuns requests in parallelAutoma�cally follow all linksBasically adds the paralleliza�on framework on top of plain Python
abot.NET scraperCommercial version
Mozenda (Windows desktop app)Many more
GUI TOOLSGUI TOOLSPor�a
GUI that lets you select elements to scrapeSelect sample pages to give it a pa�ernPredefined or regex extractorsSeems to work for simple, very well-structured data
PLAIN OLD PYTHONPLAIN OLD PYTHONThis seems to be very commonrequests, lxml.html.soupparser
CRAWLING AS A SERVICECRAWLING AS A SERVICEMany companies
Some provide the crawling infrastructure and let you submit your own code (ScrapingHub)Describe what you want and they'll write/run scraper~$300 for setup and $40/month for data updates (DataHut, others)Handle any breaking changes
Custom scraper is going to give best data quality
ANTI-CRAWLING TECHNIQUESANTI-CRAWLING TECHNIQUES
ROBOTS.TXTROBOTS.TXTAsk politely
User-agent: *Disallow: /
ROBOTS.TXT: COUNTERMEASURESROBOTS.TXT: COUNTERMEASURESIgnore it
Isn't legally binding
HONEYPOTSHONEYPOTSCreate pages that normal users would never access
e.g. CSS-hidden linksBlock IP address of host accessing the honeypot page
HONEYPOTS: COUNTERMEASURESHONEYPOTS: COUNTERMEASURESUse real rendering engine and test for visibility before following links
Much slower than processing sta�c HTML
TEXT OBFUSCATIONTEXT OBFUSCATIONImages/Sprites to render text
Very bad for SEO, accessibility, and performance
TEXT OBFUSCATION: COUNTERMEASURESTEXT OBFUSCATION: COUNTERMEASURESOCR of screenshots of pageWould make scraping very difficult
But has so many disadvantages to the site owner
BEHAVIOR ANALYSISBEHAVIOR ANALYSISLook for methodical/unnatural HTTP request pa�erns
Bots crawl sites as if they were traversing a tree (breadth or depth-first)Bots tend to access pages in rapid successionMissing requests for things that normal browsers would downloadMachine learning useful hereCaptchas to mi�gate false posi�ves (Google does this)Large sites may have many legi�mate users on only a few IP addressesBan IP if suspicious
Somehow Bing avoided this
BEHAVIOR ANALYSIS:COUNTERMEASURESBEHAVIOR ANALYSIS:COUNTERMEASURESAdd randomness to crawl orderDon't crawl too fast from single IP addrUse mul�ple IP addresses (see below)Fetch cluster
Load one page with one host and move on to another
HTTP Proxy serviceThousands of IP addressesMul�ple countriesBan detec�on and auto retryBasically a (presumably) legal botnet
Crawlera
REGISTRATION/CAPTCHAREGISTRATION/CAPTCHARequire user registra�on/Captcha to even view data
Easy to trace usage for individual accounts and banUse strong Captcha to prevent bot registra�ons
Require phone/credit card verifica�on (i.e. Facebook)Very poor user experience
Lost revenue
REGISTRATION/CAPTCHA: COUNTERMEASURESREGISTRATION/CAPTCHA: COUNTERMEASURESCaptcha solving services
Hire the poor in Southeast AsiaPrices seem to average ~ $1 per 1000 captchas in bulk~ 10 sec response �me
For phone: disposable SIM cards/Disposable debit cardsVery burdensome
CHANGE PAGE STRUCTURECHANGE PAGE STRUCTUREPeriodically change the markup structure of your pages
Change CSS class namesChange DOM IDsChange hierarchySomewhat burdensome on developer resourcesSupposedly one of the most effec�ve means
CHANGE PAGE STRUCTURE: COUNTERMEASURESCHANGE PAGE STRUCTURE: COUNTERMEASURESChange scraping logic
Increases maint. cost of scrapingUse AI techniques to scrape data independent of markup
IS IT LEGAL?IS IT LEGAL?Many overlapping legal issues
Copyright viola�onsHTML may be copyrightable (Media.net v. Netseer)
Sufficiently crea�veFacebook v. Power.com
Ephimeral, unauthorized downloads of copyrighted materialBoilerplate text, videos, photos, etc.
Had to download whole page that had copyrighted materialRejected Connect API over license
DMCA
Breach of Contract (Terms of Use)Browsewrap vs clickwrapBrowsewrap is legally very weak (Cvent v. Eventbrite)Clickwrap imprac�cal for public sites
Computer Fraud and Abuse Act (CFAA; Federal law)Passed in 1986 as criminal lawAmendment in 1994 allowed civil ac�onsExceeding authoriza�on
Site owner tells you to stop accessing site revokes authoriza�onTerms of Use (Browsewrap)
Must cause damage or loss directly related to computer accesse.g. by excess server load that causes lost traffic or server crashMust be inten�onalDoes not include loss from use of stolen data
Common to see mul�ple claims in one lawsuit to maximize chances of success"Proxy defenses" usually don't workOther charges include:
Trespass to Cha�els (not used as much anymore)Interfere with another's posesssion of property (e.g. servers)
Unjust enrichment
EBAY V. BIDDER'S EDGEEBAY V. BIDDER'S EDGEFirst major scraping case from 2000BE was an auc�on site aggregatorIn 1999, eBay allowed BE to crawl site for 90 daysFailed to formalize license agreement
eBay wanted on-demand crawlingBE wanted periodic crawling
At end of 90 days, BE con�nued crawling despite no agreement
EBAY V. BIDDER'S EDGE (CONT.)EBAY V. BIDDER'S EDGE (CONT.)IPs were blocked but BE used proxieseBay sought preliminary injunc�on to block BE from access site
Claimed 8 different causeseBay was granted injunc�on based on trespass of cha�els claim
Based on hypothe�cal harm to servers if anyone could freely crawlThis reasoning later rejected in separate case
Se�led out of court
AA V. FARECHASEAA V. FARECHASEFrom 2003Farechase sold so�ware that specifically scraped AA.comAA sent cease & desistScraping violated terms of useAA got injunc�on barring so�ware sales
Based primarily on trespass to cha�els
CRAIGSLIST V. 3TAPSCRAIGSLIST V. 3TAPSFederal court in California in 2012Scraped CL house lis�ngs to resellCL made all posts copyrighted for three weeksCL sent cease-and-desist le�er to 3Taps and blocked IPs
3Taps ignored it and used proxiesCourt ruled copyright claims were enforcible and that CFAA was violated$1M se�lement and permanent injunc�on
The court likened its decision to allowing a store to open itself to the public butalso to ban a disrup�ve person if it needed to.
QVC V. RESULTLYQVC V. RESULTLYResultly: Web retail aggregatorHad no prior rela�onship to QVCOverloaded QVCs servers for two days
Claimed $2M in lose revenuePeaked at ~ 400 reqs/sec
QVC sought preliminary injunc�on to freeze Resultly's assets based on CFAA claimDenied due to lack of inten�onalityResultly couldn't have known that their servers would crash
CONCLUSIONCONCLUSIONIf the site tells you to stop and you keep doing it, you're probably in a bad posi�on in court
Scraping is ubiquitous but of dubious legality
THE ENDTHE END