when rss fails: web scraping with http
DESCRIPTION
A brief introduction to the HTTP protocol for use in web scraping, best practices, and availability of PHP-based HTTP client libraries.TRANSCRIPT
![Page 1: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/1.jpg)
When RSS Fails:Web Scraping with HTTP
Matthew TurlandSenior ConsultantBlue Parabola LLCFebruary 27, 2009
![Page 2: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/2.jpg)
What is Web Scraping?
A 2 Step Process
![Page 3: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/3.jpg)
Its Goal: Data
![Page 4: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/4.jpg)
Obtain It
![Page 5: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/5.jpg)
Transform It
![Page 6: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/6.jpg)
Automate It
![Page 7: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/7.jpg)
Step 1: Retrieval
![Page 8: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/8.jpg)
The Client
![Page 9: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/9.jpg)
The Server
![Page 10: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/10.jpg)
The Request
![Page 11: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/11.jpg)
The Response
![Page 12: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/12.jpg)
Or In Your Case
![Page 13: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/13.jpg)
Step #2: Analysis
![Page 14: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/14.jpg)
Locate Desired Data
![Page 15: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/15.jpg)
Extract It
![Page 16: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/16.jpg)
Use It
![Page 17: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/17.jpg)
2 Step Process
Step 1:Retrieval GET /some/resource
...
HTTP/1.1 200 OK... Resource
with data you want
Step 2:Analysis
Rawresource
Usabledata
So To Recap
![Page 18: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/18.jpg)
Data Mining
Focus in data mining Focus in web scraping
Consuming Web Services
Web service data formats Web scraping data formats
How Is It Different?
![Page 19: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/19.jpg)
System integration
Crawlersand indexers
Integrationtesting
What Is It Used For?
![Page 20: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/20.jpg)
Disadvantages
![Page 21: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/21.jpg)
One small change to markup...
![Page 22: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/22.jpg)
... may break your application.
![Page 23: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/23.jpg)
Or in modern terms...
![Page 24: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/24.jpg)
Reverse Engineering Required
![Page 25: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/25.jpg)
Multiple Requests
![Page 26: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/26.jpg)
No Nice Neat Data Package
![Page 27: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/27.jpg)
Quite the Opposite, In Fact
![Page 28: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/28.jpg)
Use one like this:
To do this:
Know enough HTTP to...
![Page 29: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/29.jpg)
PEAR::HTTP_Client pecl_http Zend_Http_Client
Learn to use and troubleshoot one like this:
Or roll your own!
cURLFilesystem + Streams
Know enough HTTP to...
![Page 30: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/30.jpg)
GET /wiki/Main_Page HTTP/1.1
Host: en.wikipedia.org
method or operation
URI address for the desired resource
protocol version in use by the client
header name header value
request line
header
more headers follow...
Let's GET Started
![Page 31: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/31.jpg)
1. Uniquely identifies a resource
2. Indicates how to locate a resource
3. Does both and is thus human-usable.
URI
URL
More info in RFC 3986 Sections 1.1.3 and 1.2.2
URI vs URL
![Page 32: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/32.jpg)
In principle:"Let's do this by the book."
GET
In reality:"'Safe operation'? Whatever."
GET
Warning about GET
![Page 33: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/33.jpg)
http://en.wikipedia.org/w/index.php? title=Query_string&action=edit
URLQuery String
Question mark to separatethe resource address and query string
Equal signs to separate parameternames and respective values
Ampersands to separate parameter name-value pairs. Parameter
Value
Query Strings
![Page 34: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/34.jpg)
Parameter Value
first
second
this is a field
is it clear enough (already)?
Query Stringfirst=this+is+a+field&second=is+it+clear+%28already%29%3F
Also called percent encoding.
parse_str, urlencode, urldecode: Handy PHP URL functions
$_SERVER['QUERY_STRING'] / http_build_query($_GET)
More info on URL encoding in RFC 3986 Section 2.1
URL Encoding
![Page 35: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/35.jpg)
Most CommonHTTP Operations1. GET2. POST...
/w/index.phpPOST
/new/resource-or-
/updated/resource
GET /some/resource HTTP/1.1Header: Value...
POST /some/resource HTTP/1.1Header: Value
request body
none
POST Requests
![Page 36: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/36.jpg)
POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1ContentType: application/xwwwformurlencoded
wpStarttime=20080719022313&wpEdittime=20080719022100...
Blank line separatesrequest headers and body
Content type for datasubmitted via HTML form(multipart/form-data for file uploads)
Request body... look familiar?
Note: Most browsers have a query string length limit.Lowest known common denominator: IE7strlen(entire URL) <= 2,048 bytes.This limit is not standardized. It applies to query strings, but not request bodies.
POST Request Example
![Page 37: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/37.jpg)
HEAD /wiki/Main_Page HTTP/1.1Host: en.wikipedia.org
Same as GET with two exceptions:
1
HTTP/1.1 200 OKHeader: Value
2 No response body
HEAD vs GET
HeadersBody
Sometimes headersare all you want
?
HEAD Request
![Page 38: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/38.jpg)
HTTP/1.0 200 OKServer: ApacheXPoweredBy: PHP/5.2.5...
[body]
Lowest protocol versionrequired to process theresponse
Responsestatus code Response
status description
Status line
Same header format asrequests, but different headers are used(see RFC 2616 Section 14)
Responses
![Page 39: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/39.jpg)
1xx InformationalRequest received, continuing process.
2xx SuccessRequest received, understood, and accepted.
3xx RedirectionClient must take additional action to complete the request.
4xx Client ErrorRequest is malformed or could not be fulfilled.
5xx Server ErrorRequest was valid, but the server failed to process it.
See RFC 2616 Section 10 for more info.
Response Status Codes
![Page 40: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/40.jpg)
Set-Cookie
Cookie
Location Watch out for infinite loops!
Last-Modified
If-Modified-Since
304 Not Modified
ETag
If-None-MatchOR
See RFC 2109 or RFC 2965for more info.
Headers
![Page 41: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/41.jpg)
WWW-Authenticate
Authorization
User-Agent
200 OK / 403 Forbidden
See RFC 2617for more info.
User-Agent:
Some servers performuser agent sniffing
Some clients performuser agent spoofing
More Headers
![Page 42: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/42.jpg)
Best Practices
![Page 43: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/43.jpg)
Simulate User Behavior
![Page 44: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/44.jpg)
Minimize Requests
![Page 45: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/45.jpg)
Batch Jobs, Non-Peak Hours
![Page 46: When RSS Fails: Web Scraping with HTTP](https://reader033.vdocuments.net/reader033/viewer/2022042514/5579ab8ed8b42ac1148b4dfe/html5/thumbnails/46.jpg)
Questions?
No heckling... OK, maybe just a little.
I generally blog about my experiences with web scraping
and PHP at http://ishouldbecoding.com.
</shameless_plug>
Thanks for coming!