introduce of the parallel distributed crawler with scraping dynamic html
DESCRIPTION
動的HTMLスクレイピング対応並列分散クローラのご紹介 札幌Ruby会議02TRANSCRIPT
![Page 1: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/1.jpg)
動的HTMLスクレイピング対応並列分散クローラ
のご紹介Introduce of
the parallel distributed Crawlerwith scraping Dynamic HTML
![Page 2: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/2.jpg)
Before that,シラツチ ケイ
✓白土 慧✓id: kei-s✓@kei_s
![Page 3: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/3.jpg)
Before that,✓Born in Sapporo
Now live in Tokyo
✓RubyKaigi2009 Staffw/ Ruby札幌
✓I like Ruby & JavaScript
![Page 4: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/4.jpg)
サイエンスとエンジニアリング
提 供
![Page 5: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/5.jpg)
I want ...
![Page 6: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/6.jpg)
A lot ofweb data!!
![Page 7: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/7.jpg)
How to gather it?✓Web API✓HTML Scraping
How to scrape HTML?
✓Mechanize & Nokogiri (鋸)
![Page 8: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/8.jpg)
But,
How about Dynamic HTML?
![Page 9: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/9.jpg)
![Page 10: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/10.jpg)
in HTML source
![Page 11: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/11.jpg)
the parallelize distributed Crawler with scraping Dynamic HTML
Greasihttp://github.com/kei-s/greasi
![Page 12: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/12.jpg)
DEMO
![Page 13: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/13.jpg)
Outline of Crawler
data
Server Clients
URL
data
Web Pageaccess
DOM
・・・
・・・
URL
![Page 14: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/14.jpg)
Server Side : Requirements
✓Receive and Store data
![Page 15: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/15.jpg)
My Choice
![Page 16: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/16.jpg)
Client Side : Requirements
✓Evaluate Dynamic HTML like browser
What do you choice if it is you?
![Page 17: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/17.jpg)
My Choice
![Page 18: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/18.jpg)
How it works?
![Page 19: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/19.jpg)
Server Side: Code Snippets
require 'rubygems'require 'sinatra'
post '/' do url = params[:url] data = params[:data] store(url, data) next_url = process(url) next_urlend
![Page 20: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/20.jpg)
Client Side: Code Snippets
// ==UserScript==// @name greasi_scraper// @namespace http://libelabo.jp/// @include http://images.google.co.jp/*// @require http://ajax.googleapis.com/ajax/libs/jquery/1.3.1/jquery.min.js// ==/UserScript==
function postData(data) { var postData = $.param({url: location.href, data: JSON.stringify(data)}); GM_xmlhttpRequest({ method: "POST", url: "http://libelabo.jp/greasi/", headers: {'Content-type':'application/x-www-form-urlencoded'}, data: postData, onload: function(xhr){ location.href = xhr.responseText } });}
![Page 21: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/21.jpg)
How to Parallelize?
![Page 22: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/22.jpg)
How to Parallelize?
Add Tabs :)
![Page 23: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/23.jpg)
How to Distribute?
![Page 24: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/24.jpg)
How to Distribute?
Install Firefox :)
![Page 25: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/25.jpg)
Summary
✓With Nice Products,
![Page 26: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/26.jpg)
Summary
✓Make it “サクッと”!✓Use it “サクッと”!
![Page 27: Introduce of the parallel distributed Crawler with scraping Dynamic HTML](https://reader036.vdocuments.net/reader036/viewer/2022081403/554f3ce7b4c905cd048b51a1/html5/thumbnails/27.jpg)
Thank you!