從零開始的爬蟲之旅 crawler from zero

47
ℂ襉樄তጱ粖恝ԏ碟 Crawler from zero [email protected] RubyConf Taiwan 2016

Upload: shi-ken-don

Post on 19-Jan-2017

264 views

Category:

Internet


4 download

TRANSCRIPT

Page 1: 從零開始的爬蟲之旅 Crawler from zero

Crawler from [email protected] Taiwan 2016

Page 2: 從零開始的爬蟲之旅 Crawler from zero

About Me• a.k.a. Shi-Ken Don

• 2009 - 2014

• Ruby Developer since 2013

• 2014 - 2015

• Web Developer at Backer-Founder since 2015

Page 3: 從零開始的爬蟲之旅 Crawler from zero

About Me• a.k.a. Shi-Ken Don

• 2009 - 2014

• Ruby Developer since 2013

• 2014 - 2015

• Web Developer at Backer-Founder since 2015

200 20

Page 4: 從零開始的爬蟲之旅 Crawler from zero

Agenda

• What I did

• Why Ruby? Ruby

• Comparison

• Know-how

Page 5: 從零開始的爬蟲之旅 Crawler from zero

Crawler Architecture

Wikipedia:File:WebCrawlerArchitecture.svg

Page 6: 從零開始的爬蟲之旅 Crawler from zero

Crawler Architecture

Wikipedia:File:WebCrawlerArchitecture.svg

Page 7: 從零開始的爬蟲之旅 Crawler from zero

CrowdTrail

:D

Page 8: 從零開始的爬蟲之旅 Crawler from zero

CrowdTrail

• 14

• 2

Page 9: 從零開始的爬蟲之旅 Crawler from zero

DEMO

CrowdTrail :D

>>>> https://goo.gl/sWfDBc <<<<

Page 10: 從零開始的爬蟲之旅 Crawler from zero

• 5000

• Web

• 0

Page 11: 從零開始的爬蟲之旅 Crawler from zero

• 5000

• Web

• 0

Kickstarter 2 T T

Page 12: 從零開始的爬蟲之旅 Crawler from zero

• JRuby

• Ruby

Page 13: 從零開始的爬蟲之旅 Crawler from zero

Ruby

Page 14: 從零開始的爬蟲之旅 Crawler from zero

Ruby

Ruby

Page 15: 從零開始的爬蟲之旅 Crawler from zero

Ruby

Ruby

Why, or Why not?

Page 16: 從零開始的爬蟲之旅 Crawler from zero

(Crawler) Python Python Ruby

Page 17: 從零開始的爬蟲之旅 Crawler from zero

Ruby

Page 18: 從零開始的爬蟲之旅 Crawler from zero

Ruby

(Scrape web content)

Page 19: 從零開始的爬蟲之旅 Crawler from zero

Network flow

Google Chrome Developer Tools screenshot

1.75 1.34 76%

Page 20: 從零開始的爬蟲之旅 Crawler from zero

Ruby Python

user system total real Ruby Thread.new 16.090000 1.010000 17.100000 ( 17.254499) Ruby Parallel 16.900000 1.100000 18.000000 ( 18.080813) Python threading 0.000000 0.000000 16.360000 ( 16.532583)

1000 thread www.facebook.com

ruby_thread_tests.rbthreading_test.py

Python Ruby

Page 21: 從零開始的爬蟲之旅 Crawler from zero

user system total real Thread.new (parse and create) 5.640000 0.580000 6.220000 ( 6.073333) Thread.new (parse only) 5.060000 0.440000 5.500000 ( 5.455837) Thread.new (create directly) 3.340000 0.520000 3.860000 ( 5.169519)

ruby_thread_sidekiq_test.rb

100 thread www.facebook.com

parse only

Page 22: 從零開始的爬蟲之旅 Crawler from zero

• ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5.000 seconds (waited 5.009 seconds)

Page 23: 從零開始的爬蟲之旅 Crawler from zero

database.yml

default: &default adapter: postgresql encoding: unicode pool: 25 # Increase this

Page 24: 從零開始的爬蟲之旅 Crawler from zero

Thread

begin # … ProjectLog.create!(title: title) ensure ActiveRecord::Base.clear_active_connections! end

Page 25: 從零開始的爬蟲之旅 Crawler from zero

Sidekiq

• 200 8

• Ruby Thread

• Auto scaling

Page 26: 從零開始的爬蟲之旅 Crawler from zero

Heroku Auto Scaling

heroku.rake

task :auto_scaling, [:WORKER_NAME] => :environment do |_t, args| args.with_defaults(WORKER_NAME: "worker") APP_NAME = ENV["HEROKU_APP_NAME"] WORKER_NAME = args[:WORKER_NAME]

heroku = Heroku::API.new queues = Sidekiq::Queue.all queues_size = queues.map { |queue| Sidekiq::Queue.new(queue.name).size }.inject(0, :+)

# 2X dyno 600 jobs

# 50 parse project_log

# jobs 45 now_minutes = Time.now.strftime("%M").to_i # / / left_minutes = now_minutes.between?(0, 45) ? 45 - now_minutes : 0 workers_size = queues_size / 500 / [left_minutes, 1].max workers_size = 1 if workers_size < 1 workers_size = 10 if workers_size > 10 # 10 worker puts "Scaling #{WORKER_NAME} dyno count to #{workers_size}" heroku.post_ps_scale(APP_NAME, WORKER_NAME, workers_size) end

Page 27: 從零開始的爬蟲之旅 Crawler from zero

Sidekiq

• PostgreSQL > Redis > Sidekiq connections

Sidekiq Redis Thread Redis Thread

Page 28: 從零開始的爬蟲之旅 Crawler from zero

Heroku PostgreSQL Pricing

Page 29: 從零開始的爬蟲之旅 Crawler from zero

PostgreSQL Standard 2

• Rails App DB

• Rake Task DB

• Restarting

• Sidekiq MAX connections = 200

Page 30: 從零開始的爬蟲之旅 Crawler from zero

• File descriptor

✦ MacOS: 256 (default)

✦ Linux: 1024 (default)

✦ Windows: who cares

Linux File descriptor 1024

CPU RAM

Page 31: 從零開始的爬蟲之旅 Crawler from zero

Linux

• #

• cat /proc/sys/fs/file-max

• #

• sysctl -w fs.file-max=100000

Page 32: 從零開始的爬蟲之旅 Crawler from zero

Know-how

Page 33: 從零開始的爬蟲之旅 Crawler from zero

• Heroku dyno 500 Thread

• dynos#process-thread-limits

Page 34: 從零開始的爬蟲之旅 Crawler from zero

Queue (Weight)Sidekiq Queue Job

Page 35: 從零開始的爬蟲之旅 Crawler from zero

Queue

Project.find_each do |project| # Sidekiq queue name = project.platform.name SnapshotWorker.sidekiq_options_hash["queue"] = name SnapshotWorker.perform_async(…) end

Page 36: 從零開始的爬蟲之旅 Crawler from zero

Retry

sidekiq_options_hash["retry"] += 1 self.class.sidekiq_retry_in do |count| Random.rand(retry_after + count) end

Page 37: 從零開始的爬蟲之旅 Crawler from zero

• User-Agent

• Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)

• Google Bot

• Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Page 38: 從零開始的爬蟲之旅 Crawler from zero

• User-Agent

• Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)

• Google Bot

• Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) SEO Googlebot ψ( ∇´)ψ

Page 39: 從零開始的爬蟲之旅 Crawler from zero

User-Agent DEFAULT_USER_AGENT = "Mozilla/5.0 (compatible; CrowdTrail/1.0; +https://crowdwatch.tw/)"

def random_user_agent_string format( "%s Random/0.%d.%d", DEFAULT_USER_AGENT, Random.rand(100), Random.rand(100) ) end

HTTParty.get("https://www.facebook.com", headers: { "User-Agent" => random_user_agent_string })

Page 40: 從零開始的爬蟲之旅 Crawler from zero

• header IP

• X-Forward-For

• X-Real-IP

• CF-Connecting-IP

Page 41: 從零開始的爬蟲之旅 Crawler from zero

X-Forward-For

def random_x_forward_for format( "140.118.%d.%d", Random.rand(255), Random.rand(255) ) end

HTTParty.get("https://www.facebook.com", headers: { "X-Forward-For" => random_x_forward_for })

Page 42: 從零開始的爬蟲之旅 Crawler from zero

Proxy

• Proxy Server

• Proxy

• http://txt.proxyspy.net/proxy.txt

Page 43: 從零開始的爬蟲之旅 Crawler from zero

CAPTCHA

• Ruby OCR

e.g. ruby-tesseract-ocr

• antigate.com

reCAPTCHA antigate

Page 44: 從零開始的爬蟲之旅 Crawler from zero

Parser Know-how

Page 45: 從零開始的爬蟲之旅 Crawler from zero

index

# Bad doc.css('.tab')[2].text

# Good doc.css('.tab').text[/ (\d+) /, 1]

DOM Parser Regular Expression

Page 46: 從零開始的爬蟲之旅 Crawler from zero

Integer Float to_i

# Bad doc.css('.tab .pledged').text.to_i

# Good Integer(doc.css('.tab .pledged').text)

to_i nil 0

Page 47: 從零開始的爬蟲之旅 Crawler from zero

THANK YOUFollow me on

https://github.com/shikendonhttps://medium.com/@shikendon

https://www.facebook.com/zxuandon