web::scraper

79
Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email protected] Six Apart, Ltd. / Shibuya Perl Mongers YAPC::Europe 2007 Vienna

Upload: tatsuhiko-miyagawa

Post on 13-May-2015

27.052 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Web::Scraper

Practical Web Scraping

with Web::Scraper

Tatsuhiko Miyagawa [email protected]

Six Apart, Ltd. / Shibuya Perl MongersYAPC::Europe 2007 Vienna

Page 2: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Tatsuhiko Miyagawa

Page 3: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

CPAN: MIYAGAWA

Page 4: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords

Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny

Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton

Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC

Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager

Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era

Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode

File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent

HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote

Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp

Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find

PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable

Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript

Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon

Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape

WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery

WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog

XML::Atom::Stream XML::Liberal

Page 5: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Page 6: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

http://code.sixapart.com/

Page 7: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Page 8: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Practical Web Scraping

with Web::Scraper

Page 9: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.

http://en.wikipedia.org/wiki/Screen_scraping

Page 10: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.

http://en.wikipedia.org/wiki/Screen_scraping

Page 11: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

"Screen-scrapingis so 1999!"

Page 12: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Page 13: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Page 14: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

RSS is a metadatanot a complete

HTML replacement

Page 15: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Practical Web Scraping

with Web::Scraper

Page 16: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

What's wrong withLWP & Regexp?

Page 17: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Page 18: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />

Page 19: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />

> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@<strong id="ctu">(.*?)</strong>@ and print $1'Monday, August 27, 2007 at 12:49:46

Page 20: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

It works!

Page 21: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

WWW::MySpace 0.70

Page 22: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

WWW::Search::Ebay 2.231

Page 23: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

WWW::Mixi 0.50

Page 24: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

It works …

Page 25: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

There are3 problems(at least)

Page 26: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

(1)Fragile

Easy to break even with slight HTML changes(like newlines, order of attributes etc.)

Page 27: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

(2)Hard to maintain

Regular expression based scrapers are good Only when they're used in write-only scripts

Page 28: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

(3)Improper

HTML & encodinghandling

Page 29: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<span class="message">I &hearts; Vienna</span>

> perl –e '$c =~ m@<span class="message">(.*?)</span>@ and print $1'I &hearts; Vienna

Page 30: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<span class="message">I &hearts; Vienna</span>

> perl –MHTML::Entities –e '$c =~ m@<span class="message">(.*?)</span>@ and print decode_entities($1)'I ♥ Vienna

Page 31: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<span class="message"> ウィーンが大好き! </span>

> perl –MHTML::Entities –MEncode –e '$c =~ m@<span class="message">(.*?)</span>@ and print decode_entities(decode_utf8($1))'Wide character in print at –e line 1.ウィーンが大好き!

Page 32: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

The "right" wayof screen-scraping

Page 33: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

(1), (2)MaintainableLess fragile

Page 34: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Use XPathand CSS Selectors

Page 35: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

XPath

HTML::TreeBuilder::XPathXML::LibXML

Page 36: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

XPath

<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);print $tree->findnodes('//strong[@id="ctu"]')->shift->as_text;

# Monday, August 27, 2007 at 12:49:46

Page 37: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

CSS Selectors

"XPath for HTML coders""XPath for people who hates XML"

Page 38: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

CSS Selectors

body { font-size: 12px; }

div.article { padding: 1em }

span#count { color: #fff }

Page 39: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

XPath: //strong[@id="ctu"]

CSS Selector: strong#ctu

Page 40: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

CSS Selectors

<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />

use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath "strong#ctu";print $tree->findnodes($xpath)->shift->as_text;

# Monday, August 27, 2007 at 12:49:46

Page 41: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Complete Script#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;

Page 42: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Robust,Maintainable,

andSane character

handling

Page 43: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Exmaple (before)

<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />

> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@<strong id="ctu">(.*?)</strong>@ and print $1'Monday, August 27, 2007 at 12:49:46

Page 44: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Example (after)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;

Page 45: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

but …long and boring

Page 46: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Practical Web Scraping

with Web::Scraper

Page 47: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Web scraping toolkitinspired by scrapi.rb

DSL-ish

Page 48: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Example (before)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;

Page 49: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Example (after)

#!/usr/bin/perl

use strict;

use warnings;

use Web::Scraper;

use URI;

my $s = scraper {

process "strong#ctu", time => 'TEXT';

result 'time';

};

my $uri = URI->new("http://timeanddate.com/worldclock/");

print $s->scrape($uri);

Page 50: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Basics

use Web::Scraper;

my $s = scraper {

# DSL goes here

};

my $res = $s->scrape($uri);

Page 51: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

process

process $selector,

$key => $what,

…;

Page 52: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

$selector:

CSS Selectoror

XPath (start with /)

Page 53: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

$key:key for the result

hashappend "[]" for

looping

Page 54: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

$what:'@attr''TEXT'

Web::Scrapersub { … }

Hash reference

Page 55: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Page 56: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

process "ul.sites > li > a",

'urls[]' => '@href';

# { urls => [ … ] }

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Page 57: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

process '//ul[@class="sites"]/li/a',

'names[]' => 'TEXT';

# { names => [ 'OpenGuides', … ] }

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Page 58: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

process "ul.sites > li",

'sites[]' => scraper {

process 'a',

link => '@href', name => 'TEXT';

};

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Page 59: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

process "ul.sites > li > a",

'sites[]' => sub {

# $_ is HTML::Element

+{ link => $_->attr('href'), name => $_->as_text };

};

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Page 60: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

process "ul.sites > li > a",

'sites[]' => {

link => '@href', name => 'TEXT';

};

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Page 61: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

result

result; # get stash as hashref (default)result @keys; # get stash as hashref containing @keysresult $key; # get value of stash $key;

my $s = scraper { process …; process …; result 'foo', 'bar';};

Page 62: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

More Examples

Page 63: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Page 64: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Thumbnail URLs on Flickr set

#!/usr/bin/perl

use strict;

use Data::Dumper;

use Web::Scraper;

use URI;

my $url = "http://flickr.com/photos/bulknews/sets/72157601700510359/";

my $s = scraper {

process "a.image_link img", "thumbs[]" => '@src';

};

warn Dumper $s->scrape( URI->new($url) );

Page 65: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Page 66: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

<span class="vcard"> <a href="http://twitter.com/iamcal" class="url" rel="contact" title="Cal Henderson"> <img alt="Cal Henderson" class="photo fn" height="24" id="profile-image" src="http://assets0.twitter.com/…/mini/buddyicon.gif" width="24" /></a></span>

<span class="vcard">…</span>

Page 67: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Twitter Friends

#!/usr/bin/perl

use strict;

use Web::Scraper;

use URI;

use Data::Dumper;

my $url = "http://twitter.com/miyagawa";

my $s = scraper {

process "span.vcard a", "people[]" => '@title';

};

warn Dumper $s->scrape( URI->new($url) ) ;

Page 68: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Twitter Friends (complex)

#!/usr/bin/perl

use strict;

use Web::Scraper;

use URI;

use Data::Dumper;

my $url = "http://twitter.com/miyagawa";

my $s = scraper {

process "span.vcard", "people[]" => scraper {

process "a", link => '@href', name => '@title';

process "img", thumb => '@src';

};

};

warn Dumper $s->scrape( URI->new($url) ) ;

Page 69: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Tools

Page 70: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

> cpan Web::Scraper

comes with 'scraper' CLI

Page 71: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

> scraper http://example.com/

scraper> process "a", "links[]" => '@href';

scraper> d

$VAR1 = {

links => [

'http://example.org/',

'http://example.net/',

],

};

scraper> y

---

links:

- http://example.org/

- http://example.net/

Page 72: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

> scraper /path/to/foo.html

> GET http://example.com/ | scraper

Page 73: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

TODO

Page 74: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Web::ScraperNeeds documentation

Page 75: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

More examplesto put in eg/ directory

Page 76: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

integrate withWWW::Mechanize

and Test::WWW::Declare

Page 77: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

XPath Auto-suggestion

off of DOM + element

DOM + XPath => ElementDOM + Element => XPath?

(Template::Extract?)

Page 78: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Questions?

Page 79: Web::Scraper

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007

Thank you

http://search.cpan.org/dist/Web-Scraperhttp://www.slideshare.net/miyagawa/

webscraper