web scraper shibuya.pm tech talk #8

Practical Web Scraping

with Web::Scraper

Tatsuhiko Miyagawa [email protected]

Six Apart, Ltd. / Shibuya Perl MongersShibuya.pm Tech Talks #8

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/10/01 Shibuya.pm Tech Talk #82007/10/01 Shibuya.pm Tech Talk #8


with Web::Scraper


Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.

http://en.wikipedia.org/wiki/Screen_scraping


"Screen-scrapingis so 1999!"


RSS is a metadatanot a complete

HTML replacement



with Web::Scraper


What's wrong withLWP & Regexp?

<td>Current UTC (or GMT/Zulu)-time used: Monday, August 27, 2007 at 12:49:46

> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@(.*?)@ and print $1'Monday, August 27, 2007 at 12:49:46


It works!


WWW::MySpace 0.70


WWW::Search::Ebay 2.231


WWW::Mixi 0.50


It works …


There are3 problems(at least)


(1)Fragile

Easy to break even with slight HTML changes(like newlines, order of attributes etc.)


(2)Hard to maintain

Regular expression based scrapers are good Only when they're used in write-only scripts


(3)Improper

HTML & encodinghandling

I &hearts; Shibuya

> perl –e '$c =~ m@(.*?)@ and print $1'I &hearts; Shibuya

I &hearts; Shibuya

> perl –MHTML::Entities –e '$c =~ m@(.*?)@ and print decode_entities($1)'I ♥ Shibuya

Perl が大好き！ 

> perl –MHTML::Entities –MEncode –e '$c =~ m@(.*?)@ and print decode_entities(decode_utf8($1))'Wide character in print at –e line 1.Perl が大好き！


The "right" wayof screen-scraping


(1), (2)MaintainableLess fragile


Use XPathand CSS Selectors


XPath

HTML::TreeBuilder::XPathXML::LibXML


XPath


use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);print $tree->findnodes('//strong[@id="ctu"]')->shift->as_text;

# Monday, August 27, 2007 at 12:49:46


CSS Selectors

"XPath for HTML coders""XPath for people who hates XML"


CSS Selectors

body { font-size: 12px; }

div.article { padding: 1em }

span#count { color: #fff }


XPath: //strong[@id="ctu"]

CSS Selector: strong#ctu


CSS Selectors


use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath "strong#ctu";print $tree->findnodes($xpath)->shift->as_text;

# Monday, August 27, 2007 at 12:49:46


Complete Script#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);

my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;

my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;


Robust,Maintainable,

andSane character

handling

Exmaple (before)


> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@(.*?)@ and print $1'Monday, August 27, 2007 at 12:49:46


Example (after)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);




but …long and boring



with Web::Scraper


Web scraping toolkitinspired by scrapi.rb

DSL-ish


Example (before)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);




Example (after)

#!/usr/bin/perl

use strict;

use warnings;

use Web::Scraper;

use URI;

my $s = scraper {

process "strong#ctu", time => 'TEXT';

result 'time';

};

my $uri = URI->new("http://timeanddate.com/worldclock/");

print $s->scrape($uri);


Basics

use Web::Scraper;

my $s = scraper {

# DSL goes here

};

my $res = $s->scrape($uri);


process

process $selector,

$key => $what,

…;


$selector:

CSS Selectoror

XPath (start with /)


$key:key for the result

hashappend "[]" for

looping


$what:'@attr''TEXT''RAW'

Web::Scrapersub { … }

Hash reference

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>


process "ul.sites > li > a",

'urls[]' => '@href';

# { urls => [ … ] }



process '//ul[@class="sites"]/li/a',

'names[]' => 'TEXT';

# { names => [ 'OpenGuides', … ] }



process "ul.sites > li",

'sites[]' => scraper {

process 'a',

link => '@href', name => 'TEXT';

};

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };




'sites[]' => sub {

# $_ is HTML::Element

+{ link => $_->attr('href'), name => $_->as_text };

};

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };




'sites[]' => {

link => '@href', name => 'TEXT';

};

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };



result

result; # get stash as hashref (default)result @keys; # get stash as hashref containing @keysresult $key; # get value of stash $key;

my $s = scraper { process …; process …; result 'foo', 'bar';};


Live Demo


Tools


> cpan Web::Scraper

comes with 'scraper' CLI


> scraper http://example.com/

scraper> process "a", "links[]" => '@href';

scraper> d

$VAR1 = {

links => [

'http://example.org/',

'http://example.net/',

],

};

scraper> y

---

links:

- http://example.org/

- http://example.net/


> scraper /path/to/foo.html

> GET http://example.com/ | scraper


Recent Updates


0.13'c' and 'c all'

WARN in scraper


0.14automatic absolute URI for link elements

(a@href, img@src)


0.14 (cont.)'RAW' and 'HTML'


0.15$Web::Scraper::UserAgent

$scraper->user_agent


0.19support encoding detection w/ META

tags


TODO


Web::ScraperNeeds documentation


More examplesto put in eg/ directory


Alternative APIinspired by scRUBYt!


OO Backend APIif you don't like the

DSL


integrate withWWW::Mechanize

and Test::WWW::Declare


XPath Auto-suggestion

off of DOM + element

DOM + XPath => ElementDOM + Element => XPath?

(Template::Extract?)


generic XML support(e.g. RSS/Atom feeds)

extensible text filterdate, geo, hCards (microformats)

October 1st, 2007 17:13:31 +0900

process ".entry-date", date => 'TEXT:rfc822';


Summary


Web::Scraperinspired by scrapi


easy, fun, maintainable& less fragile


CSS selectorXPath


Questions?


Thank you

http://search.cpan.org/dist/Web-Scraperhttp://www.slideshare.net/miyagawa/

webscraper

web scraper shibuya.pm tech talk #8

Technology

xpath use html

xpath html

xpath qwselector

html coders xpath

use lwp

xpath xml

xpathshift print

xpath current utc