![Page 1: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/1.jpg)
WEB SCRAPINGWEB SCRAPINGBy Ben Keith
Quoin Inc.
Oct. 13, 2016
![Page 2: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/2.jpg)
WHAT IS SCRAPING?WHAT IS SCRAPING?Programa�cally collec�ng useful data from a website that does not have a machine-op�mized interface
Scraping is the inverse of rendering an HTML template with data
Normal Flow: data + template -> renderer -> HTML
Scraping: HTML -> scraper -> data + [template]
Look for APIs first
![Page 3: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/3.jpg)
CRAWLINGCRAWLINGTraversing a website for the purpose of mirroring/scrapingDone by a crawler/spider
![Page 4: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/4.jpg)
SIMPLEST CASESIMPLEST CASEWell-structured sta�c HTML served directly by the web server
List/detail pa�ernRequirements for doing it in code:
HTTP clientHTML Parser (or regex if it's very simple)CSS/XPath query library (very helpful)
Example: Quoin website news
![Page 5: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/5.jpg)
POTENTIAL PROBLEMSPOTENTIAL PROBLEMSBrowsers are very permissive with the markup they accept
Solu�on: use an equally permissive HTML parsing library
![Page 6: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/6.jpg)
MORE DIFFICULT CASEMORE DIFFICULT CASEWell-structured dynamic HTML generated by Javascript
Ini�al page request won't return data to scrapeRequirements:
Download/render page using JS/CSSParse/Query resul�ng HTMLBasically what a browser does
Example: Vine trends pageIf it's JS-rendered, you can quite possibly use AJAX endpoints
Bypass a lot of scraping
![Page 7: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/7.jpg)
PROBLEM: POORLY-STRUCTURED DATAPROBLEM: POORLY-STRUCTURED DATANeeds custom logic to get high quality datah�ps://www.ncsu.edu/directory/departmental/
![Page 8: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/8.jpg)
ADVANCED TECHNIQUESADVANCED TECHNIQUESUse AI tools to automa�cally scrape important content from pagesData MiningFind pa�erns of repea�ng structure
What differs between themWhat is boilerplate
![Page 9: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/9.jpg)
SEMANTIC WEB (WEB 3.0)SEMANTIC WEB (WEB 3.0)
![Page 10: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/10.jpg)
POPULAR TOOLS/LIBRARIESPOPULAR TOOLS/LIBRARIESScrapy
Python libRuns requests in parallelAutoma�cally follow all linksBasically adds the paralleliza�on framework on top of plain Python
abot.NET scraperCommercial version
Mozenda (Windows desktop app)Many more
![Page 11: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/11.jpg)
GUI TOOLSGUI TOOLSPor�a
GUI that lets you select elements to scrapeSelect sample pages to give it a pa�ernPredefined or regex extractorsSeems to work for simple, very well-structured data
![Page 12: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/12.jpg)
![Page 13: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/13.jpg)
PLAIN OLD PYTHONPLAIN OLD PYTHONThis seems to be very commonrequests, lxml.html.soupparser
![Page 14: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/14.jpg)
CRAWLING AS A SERVICECRAWLING AS A SERVICEMany companies
Some provide the crawling infrastructure and let you submit your own code (ScrapingHub)Describe what you want and they'll write/run scraper~$300 for setup and $40/month for data updates (DataHut, others)Handle any breaking changes
Custom scraper is going to give best data quality
![Page 15: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/15.jpg)
ANTI-CRAWLING TECHNIQUESANTI-CRAWLING TECHNIQUES
![Page 16: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/16.jpg)
ROBOTS.TXTROBOTS.TXTAsk politely
User-agent: *Disallow: /
![Page 17: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/17.jpg)
ROBOTS.TXT: COUNTERMEASURESROBOTS.TXT: COUNTERMEASURESIgnore it
Isn't legally binding
![Page 18: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/18.jpg)
HONEYPOTSHONEYPOTSCreate pages that normal users would never access
e.g. CSS-hidden linksBlock IP address of host accessing the honeypot page
![Page 19: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/19.jpg)
HONEYPOTS: COUNTERMEASURESHONEYPOTS: COUNTERMEASURESUse real rendering engine and test for visibility before following links
Much slower than processing sta�c HTML
![Page 20: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/20.jpg)
TEXT OBFUSCATIONTEXT OBFUSCATIONImages/Sprites to render text
Very bad for SEO, accessibility, and performance
![Page 21: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/21.jpg)
TEXT OBFUSCATION: COUNTERMEASURESTEXT OBFUSCATION: COUNTERMEASURESOCR of screenshots of pageWould make scraping very difficult
But has so many disadvantages to the site owner
![Page 22: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/22.jpg)
BEHAVIOR ANALYSISBEHAVIOR ANALYSISLook for methodical/unnatural HTTP request pa�erns
Bots crawl sites as if they were traversing a tree (breadth or depth-first)Bots tend to access pages in rapid successionMissing requests for things that normal browsers would downloadMachine learning useful hereCaptchas to mi�gate false posi�ves (Google does this)Large sites may have many legi�mate users on only a few IP addressesBan IP if suspicious
![Page 23: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/23.jpg)
Somehow Bing avoided this
![Page 24: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/24.jpg)
BEHAVIOR ANALYSIS:COUNTERMEASURESBEHAVIOR ANALYSIS:COUNTERMEASURESAdd randomness to crawl orderDon't crawl too fast from single IP addrUse mul�ple IP addresses (see below)Fetch cluster
Load one page with one host and move on to another
HTTP Proxy serviceThousands of IP addressesMul�ple countriesBan detec�on and auto retryBasically a (presumably) legal botnet
Crawlera
![Page 25: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/25.jpg)
REGISTRATION/CAPTCHAREGISTRATION/CAPTCHARequire user registra�on/Captcha to even view data
Easy to trace usage for individual accounts and banUse strong Captcha to prevent bot registra�ons
Require phone/credit card verifica�on (i.e. Facebook)Very poor user experience
Lost revenue
![Page 26: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/26.jpg)
REGISTRATION/CAPTCHA: COUNTERMEASURESREGISTRATION/CAPTCHA: COUNTERMEASURESCaptcha solving services
Hire the poor in Southeast AsiaPrices seem to average ~ $1 per 1000 captchas in bulk~ 10 sec response �me
For phone: disposable SIM cards/Disposable debit cardsVery burdensome
![Page 27: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/27.jpg)
CHANGE PAGE STRUCTURECHANGE PAGE STRUCTUREPeriodically change the markup structure of your pages
Change CSS class namesChange DOM IDsChange hierarchySomewhat burdensome on developer resourcesSupposedly one of the most effec�ve means
![Page 28: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/28.jpg)
CHANGE PAGE STRUCTURE: COUNTERMEASURESCHANGE PAGE STRUCTURE: COUNTERMEASURESChange scraping logic
Increases maint. cost of scrapingUse AI techniques to scrape data independent of markup
![Page 29: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/29.jpg)
IS IT LEGAL?IS IT LEGAL?Many overlapping legal issues
![Page 30: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/30.jpg)
Copyright viola�onsHTML may be copyrightable (Media.net v. Netseer)
Sufficiently crea�veFacebook v. Power.com
Ephimeral, unauthorized downloads of copyrighted materialBoilerplate text, videos, photos, etc.
Had to download whole page that had copyrighted materialRejected Connect API over license
DMCA
![Page 31: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/31.jpg)
Breach of Contract (Terms of Use)Browsewrap vs clickwrapBrowsewrap is legally very weak (Cvent v. Eventbrite)Clickwrap imprac�cal for public sites
![Page 32: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/32.jpg)
Computer Fraud and Abuse Act (CFAA; Federal law)Passed in 1986 as criminal lawAmendment in 1994 allowed civil ac�onsExceeding authoriza�on
Site owner tells you to stop accessing site revokes authoriza�onTerms of Use (Browsewrap)
Must cause damage or loss directly related to computer accesse.g. by excess server load that causes lost traffic or server crashMust be inten�onalDoes not include loss from use of stolen data
![Page 33: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/33.jpg)
Common to see mul�ple claims in one lawsuit to maximize chances of success"Proxy defenses" usually don't workOther charges include:
Trespass to Cha�els (not used as much anymore)Interfere with another's posesssion of property (e.g. servers)
Unjust enrichment
![Page 34: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/34.jpg)
EBAY V. BIDDER'S EDGEEBAY V. BIDDER'S EDGEFirst major scraping case from 2000BE was an auc�on site aggregatorIn 1999, eBay allowed BE to crawl site for 90 daysFailed to formalize license agreement
eBay wanted on-demand crawlingBE wanted periodic crawling
At end of 90 days, BE con�nued crawling despite no agreement
![Page 35: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/35.jpg)
EBAY V. BIDDER'S EDGE (CONT.)EBAY V. BIDDER'S EDGE (CONT.)IPs were blocked but BE used proxieseBay sought preliminary injunc�on to block BE from access site
Claimed 8 different causeseBay was granted injunc�on based on trespass of cha�els claim
Based on hypothe�cal harm to servers if anyone could freely crawlThis reasoning later rejected in separate case
Se�led out of court
![Page 36: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/36.jpg)
AA V. FARECHASEAA V. FARECHASEFrom 2003Farechase sold so�ware that specifically scraped AA.comAA sent cease & desistScraping violated terms of useAA got injunc�on barring so�ware sales
Based primarily on trespass to cha�els
![Page 37: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/37.jpg)
CRAIGSLIST V. 3TAPSCRAIGSLIST V. 3TAPSFederal court in California in 2012Scraped CL house lis�ngs to resellCL made all posts copyrighted for three weeksCL sent cease-and-desist le�er to 3Taps and blocked IPs
3Taps ignored it and used proxiesCourt ruled copyright claims were enforcible and that CFAA was violated$1M se�lement and permanent injunc�on
The court likened its decision to allowing a store to open itself to the public butalso to ban a disrup�ve person if it needed to.
![Page 38: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/38.jpg)
QVC V. RESULTLYQVC V. RESULTLYResultly: Web retail aggregatorHad no prior rela�onship to QVCOverloaded QVCs servers for two days
Claimed $2M in lose revenuePeaked at ~ 400 reqs/sec
QVC sought preliminary injunc�on to freeze Resultly's assets based on CFAA claimDenied due to lack of inten�onalityResultly couldn't have known that their servers would crash
![Page 39: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/39.jpg)
CONCLUSIONCONCLUSIONIf the site tells you to stop and you keep doing it, you're probably in a bad posi�on in court
Scraping is ubiquitous but of dubious legality
![Page 40: Web Scraping - d2wakvpiukf49j.cloudfront.net · Scrapy Python lib Runs requests in parallel Automa cally follow all links Basically adds the paralleliza on framework on top of plain](https://reader034.vdocuments.net/reader034/viewer/2022052313/5af56f767f8b9a4d4d8f0e80/html5/thumbnails/40.jpg)
THE ENDTHE END