web scraping - d2wakvpiukf49j.cloudfront.net · scrapy python lib runs requests in parallel automa...

WEB SCRAPINGWEB SCRAPINGBy Ben Keith

Quoin Inc.

Oct. 13, 2016

WHAT IS SCRAPING?WHAT IS SCRAPING?Programa�cally collec�ng useful data from a website that does not have a machine-op�mized interface

Scraping is the inverse of rendering an HTML template with data

Normal Flow: data + template -> renderer -> HTML

Scraping: HTML -> scraper -> data + [template]

Look for APIs first

CRAWLINGCRAWLINGTraversing a website for the purpose of mirroring/scrapingDone by a crawler/spider

SIMPLEST CASESIMPLEST CASEWell-structured sta�c HTML served directly by the web server

List/detail pa�ernRequirements for doing it in code:

HTTP clientHTML Parser (or regex if it's very simple)CSS/XPath query library (very helpful)

Example: Quoin website news

POTENTIAL PROBLEMSPOTENTIAL PROBLEMSBrowsers are very permissive with the markup they accept

Solu�on: use an equally permissive HTML parsing library

MORE DIFFICULT CASEMORE DIFFICULT CASEWell-structured dynamic HTML generated by Javascript

Ini�al page request won't return data to scrapeRequirements:

Download/render page using JS/CSSParse/Query resul�ng HTMLBasically what a browser does

Example: Vine trends pageIf it's JS-rendered, you can quite possibly use AJAX endpoints

Bypass a lot of scraping

PROBLEM: POORLY-STRUCTURED DATAPROBLEM: POORLY-STRUCTURED DATANeeds custom logic to get high quality datah�ps://www.ncsu.edu/directory/departmental/

ADVANCED TECHNIQUESADVANCED TECHNIQUESUse AI tools to automa�cally scrape important content from pagesData MiningFind pa�erns of repea�ng structure

What differs between themWhat is boilerplate

SEMANTIC WEB (WEB 3.0)SEMANTIC WEB (WEB 3.0)

POPULAR TOOLS/LIBRARIESPOPULAR TOOLS/LIBRARIESScrapy

Python libRuns requests in parallelAutoma�cally follow all linksBasically adds the paralleliza�on framework on top of plain Python

abot.NET scraperCommercial version

Mozenda (Windows desktop app)Many more

GUI TOOLSGUI TOOLSPor�a

GUI that lets you select elements to scrapeSelect sample pages to give it a pa�ernPredefined or regex extractorsSeems to work for simple, very well-structured data

PLAIN OLD PYTHONPLAIN OLD PYTHONThis seems to be very commonrequests, lxml.html.soupparser

CRAWLING AS A SERVICECRAWLING AS A SERVICEMany companies

Some provide the crawling infrastructure and let you submit your own code (ScrapingHub)Describe what you want and they'll write/run scraper~$300 for setup and $40/month for data updates (DataHut, others)Handle any breaking changes

Custom scraper is going to give best data quality

ANTI-CRAWLING TECHNIQUESANTI-CRAWLING TECHNIQUES

ROBOTS.TXTROBOTS.TXTAsk politely

User-agent: *Disallow: /

ROBOTS.TXT: COUNTERMEASURESROBOTS.TXT: COUNTERMEASURESIgnore it

Isn't legally binding

HONEYPOTSHONEYPOTSCreate pages that normal users would never access

e.g. CSS-hidden linksBlock IP address of host accessing the honeypot page

HONEYPOTS: COUNTERMEASURESHONEYPOTS: COUNTERMEASURESUse real rendering engine and test for visibility before following links

Much slower than processing sta�c HTML

TEXT OBFUSCATIONTEXT OBFUSCATIONImages/Sprites to render text

Very bad for SEO, accessibility, and performance

TEXT OBFUSCATION: COUNTERMEASURESTEXT OBFUSCATION: COUNTERMEASURESOCR of screenshots of pageWould make scraping very difficult

But has so many disadvantages to the site owner

BEHAVIOR ANALYSISBEHAVIOR ANALYSISLook for methodical/unnatural HTTP request pa�erns

Bots crawl sites as if they were traversing a tree (breadth or depth-first)Bots tend to access pages in rapid successionMissing requests for things that normal browsers would downloadMachine learning useful hereCaptchas to mi�gate false posi�ves (Google does this)Large sites may have many legi�mate users on only a few IP addressesBan IP if suspicious

Somehow Bing avoided this

BEHAVIOR ANALYSIS:COUNTERMEASURESBEHAVIOR ANALYSIS:COUNTERMEASURESAdd randomness to crawl orderDon't crawl too fast from single IP addrUse mul�ple IP addresses (see below)Fetch cluster

Load one page with one host and move on to another

HTTP Proxy serviceThousands of IP addressesMul�ple countriesBan detec�on and auto retryBasically a (presumably) legal botnet

Crawlera

REGISTRATION/CAPTCHAREGISTRATION/CAPTCHARequire user registra�on/Captcha to even view data

Easy to trace usage for individual accounts and banUse strong Captcha to prevent bot registra�ons

Require phone/credit card verifica�on (i.e. Facebook)Very poor user experience

Lost revenue

REGISTRATION/CAPTCHA: COUNTERMEASURESREGISTRATION/CAPTCHA: COUNTERMEASURESCaptcha solving services

Hire the poor in Southeast AsiaPrices seem to average ~ $1 per 1000 captchas in bulk~ 10 sec response �me

For phone: disposable SIM cards/Disposable debit cardsVery burdensome

CHANGE PAGE STRUCTURECHANGE PAGE STRUCTUREPeriodically change the markup structure of your pages

Change CSS class namesChange DOM IDsChange hierarchySomewhat burdensome on developer resourcesSupposedly one of the most effec�ve means

CHANGE PAGE STRUCTURE: COUNTERMEASURESCHANGE PAGE STRUCTURE: COUNTERMEASURESChange scraping logic

Increases maint. cost of scrapingUse AI techniques to scrape data independent of markup

IS IT LEGAL?IS IT LEGAL?Many overlapping legal issues

Copyright viola�onsHTML may be copyrightable (Media.net v. Netseer)

Sufficiently crea�veFacebook v. Power.com

Ephimeral, unauthorized downloads of copyrighted materialBoilerplate text, videos, photos, etc.

Had to download whole page that had copyrighted materialRejected Connect API over license

DMCA

Breach of Contract (Terms of Use)Browsewrap vs clickwrapBrowsewrap is legally very weak (Cvent v. Eventbrite)Clickwrap imprac�cal for public sites

Computer Fraud and Abuse Act (CFAA; Federal law)Passed in 1986 as criminal lawAmendment in 1994 allowed civil ac�onsExceeding authoriza�on

Site owner tells you to stop accessing site revokes authoriza�onTerms of Use (Browsewrap)

Must cause damage or loss directly related to computer accesse.g. by excess server load that causes lost traffic or server crashMust be inten�onalDoes not include loss from use of stolen data

Common to see mul�ple claims in one lawsuit to maximize chances of success"Proxy defenses" usually don't workOther charges include:

Trespass to Cha�els (not used as much anymore)Interfere with another's posesssion of property (e.g. servers)

Unjust enrichment

EBAY V. BIDDER'S EDGEEBAY V. BIDDER'S EDGEFirst major scraping case from 2000BE was an auc�on site aggregatorIn 1999, eBay allowed BE to crawl site for 90 daysFailed to formalize license agreement

eBay wanted on-demand crawlingBE wanted periodic crawling

At end of 90 days, BE con�nued crawling despite no agreement

EBAY V. BIDDER'S EDGE (CONT.)EBAY V. BIDDER'S EDGE (CONT.)IPs were blocked but BE used proxieseBay sought preliminary injunc�on to block BE from access site

Claimed 8 different causeseBay was granted injunc�on based on trespass of cha�els claim

Based on hypothe�cal harm to servers if anyone could freely crawlThis reasoning later rejected in separate case

Se�led out of court

AA V. FARECHASEAA V. FARECHASEFrom 2003Farechase sold so�ware that specifically scraped AA.comAA sent cease & desistScraping violated terms of useAA got injunc�on barring so�ware sales

Based primarily on trespass to cha�els

CRAIGSLIST V. 3TAPSCRAIGSLIST V. 3TAPSFederal court in California in 2012Scraped CL house lis�ngs to resellCL made all posts copyrighted for three weeksCL sent cease-and-desist le�er to 3Taps and blocked IPs

3Taps ignored it and used proxiesCourt ruled copyright claims were enforcible and that CFAA was violated$1M se�lement and permanent injunc�on

The court likened its decision to allowing a store to open itself to the public butalso to ban a disrup�ve person if it needed to.

QVC V. RESULTLYQVC V. RESULTLYResultly: Web retail aggregatorHad no prior rela�onship to QVCOverloaded QVCs servers for two days

Claimed $2M in lose revenuePeaked at ~ 400 reqs/sec

QVC sought preliminary injunc�on to freeze Resultly's assets based on CFAA claimDenied due to lack of inten�onalityResultly couldn't have known that their servers would crash

CONCLUSIONCONCLUSIONIf the site tells you to stop and you keep doing it, you're probably in a bad posi�on in court

Scraping is ubiquitous but of dubious legality

THE ENDTHE END

web scraping - d2wakvpiukf49j.cloudfront.net · scrapy python lib runs requests in parallel automa...

Documents