web::scraper

Practical Web Scraping

with Web::Scraper

Tatsuhiko Miyagawa [email protected]

Six Apart, Ltd. / Shibuya Perl MongersYAPC::Europe 2007 Vienna

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Tatsuhiko Miyagawa


CPAN: MIYAGAWA


abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords

Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny

Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton

Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC

Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager

Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era

Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode

File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent

HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote

Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp

Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find

PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable

Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript

Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon

Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape

WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery

WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog

XML::Atom::Stream XML::Liberal


http://code.sixapart.com/



with Web::Scraper


Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.

http://en.wikipedia.org/wiki/Screen_scraping


"Screen-scrapingis so 1999!"


RSS is a metadatanot a complete

HTML replacement



with Web::Scraper


What's wrong withLWP & Regexp?

<td>Current UTC (or GMT/Zulu)-time used: Monday, August 27, 2007 at 12:49:46

> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@(.*?)@ and print $1'Monday, August 27, 2007 at 12:49:46


It works!


WWW::MySpace 0.70


WWW::Search::Ebay 2.231


WWW::Mixi 0.50


It works …


There are3 problems(at least)


(1)Fragile

Easy to break even with slight HTML changes(like newlines, order of attributes etc.)


(2)Hard to maintain

Regular expression based scrapers are good Only when they're used in write-only scripts


(3)Improper

HTML & encodinghandling

I &hearts; Vienna

> perl –e '$c =~ m@(.*?)@ and print $1'I &hearts; Vienna

I &hearts; Vienna

> perl –MHTML::Entities –e '$c =~ m@(.*?)@ and print decode_entities($1)'I ♥ Vienna

ウィーンが大好き！ 

> perl –MHTML::Entities –MEncode –e '$c =~ m@(.*?)@ and print decode_entities(decode_utf8($1))'Wide character in print at –e line 1.ウィーンが大好き！


The "right" wayof screen-scraping


(1), (2)MaintainableLess fragile


Use XPathand CSS Selectors


XPath

HTML::TreeBuilder::XPathXML::LibXML


XPath


use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);print $tree->findnodes('//strong[@id="ctu"]')->shift->as_text;

# Monday, August 27, 2007 at 12:49:46


CSS Selectors

"XPath for HTML coders""XPath for people who hates XML"


CSS Selectors

body { font-size: 12px; }

div.article { padding: 1em }

span#count { color: #fff }


XPath: //strong[@id="ctu"]

CSS Selector: strong#ctu


CSS Selectors


use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath "strong#ctu";print $tree->findnodes($xpath)->shift->as_text;

# Monday, August 27, 2007 at 12:49:46


Complete Script#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;


Robust,Maintainable,

andSane character

handling

Exmaple (before)


> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@(.*?)@ and print $1'Monday, August 27, 2007 at 12:49:46


Example (after)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);




but …long and boring



with Web::Scraper


Web scraping toolkitinspired by scrapi.rb

DSL-ish


Example (before)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);




Example (after)

#!/usr/bin/perl

use strict;

use warnings;

use Web::Scraper;

use URI;

my $s = scraper {

process "strong#ctu", time => 'TEXT';

result 'time';

};

my $uri = URI->new("http://timeanddate.com/worldclock/");

print $s->scrape($uri);


Basics

use Web::Scraper;

my $s = scraper {

# DSL goes here

};

my $res = $s->scrape($uri);


process

process $selector,

$key => $what,

…;


$selector:

CSS Selectoror

XPath (start with /)


$key:key for the result

hashappend "[]" for

looping


$what:'@attr''TEXT'

Web::Scrapersub { … }

Hash reference

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>


process "ul.sites > li > a",

'urls[]' => '@href';

# { urls => [ … ] }



process '//ul[@class="sites"]/li/a',

'names[]' => 'TEXT';

# { names => [ 'OpenGuides', … ] }



process "ul.sites > li",

'sites[]' => scraper {

process 'a',

link => '@href', name => 'TEXT';

};

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };




'sites[]' => sub {

# $_ is HTML::Element

+{ link => $_->attr('href'), name => $_->as_text };

};

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };




'sites[]' => {

link => '@href', name => 'TEXT';

};

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };



result

result; # get stash as hashref (default)result @keys; # get stash as hashref containing @keysresult $key; # get value of stash $key;

my $s = scraper { process …; process …; result 'foo', 'bar';};


More Examples


Thumbnail URLs on Flickr set

#!/usr/bin/perl

use strict;

use Data::Dumper;

use Web::Scraper;

use URI;

my $url = "http://flickr.com/photos/bulknews/sets/72157601700510359/";

my $s = scraper {

process "a.image_link img", "thumbs[]" => '@src';

};

warn Dumper $s->scrape( URI->new($url) );

<a href="http://twitter.com/iamcal" class="url" rel="contact" title="Cal Henderson"> <img alt="Cal Henderson" class="photo fn" height="24" id="profile-image" src="http://assets0.twitter.com/…/mini/buddyicon.gif" width="24" /></a>

…


Twitter Friends

#!/usr/bin/perl

use strict;

use Web::Scraper;

use URI;

use Data::Dumper;

my $url = "http://twitter.com/miyagawa";

my $s = scraper {

process "span.vcard a", "people[]" => '@title';

};

warn Dumper $s->scrape( URI->new($url) ) ;


Twitter Friends (complex)

#!/usr/bin/perl

use strict;

use Web::Scraper;

use URI;

use Data::Dumper;

my $url = "http://twitter.com/miyagawa";

my $s = scraper {

process "span.vcard", "people[]" => scraper {

process "a", link => '@href', name => '@title';

process "img", thumb => '@src';

};

};

warn Dumper $s->scrape( URI->new($url) ) ;


Tools


> cpan Web::Scraper

comes with 'scraper' CLI


> scraper http://example.com/

scraper> process "a", "links[]" => '@href';

scraper> d

$VAR1 = {

links => [

'http://example.org/',

'http://example.net/',

],

};

scraper> y

---

links:

- http://example.org/

- http://example.net/


> scraper /path/to/foo.html

> GET http://example.com/ | scraper


TODO


Web::ScraperNeeds documentation


More examplesto put in eg/ directory


integrate withWWW::Mechanize

and Test::WWW::Declare


XPath Auto-suggestion

off of DOM + element

DOM + XPath => ElementDOM + Element => XPath?

(Template::Extract?)


Questions?


Thank you

http://search.cpan.org/dist/Web-Scraperhttp://www.slideshare.net/miyagawa/

webscraper

web::scraper

Technology

xpath html

php apache

clickable apache

jp apache

template template

singleton apache

defaultcharset apache

no404proxy apache