從零開始的爬蟲之旅 crawler from zero

Crawler from [email protected] Taiwan 2016

About Me• a.k.a. Shi-Ken Don

• 2009 - 2014

• Ruby Developer since 2013

• 2014 - 2015

• Web Developer at Backer-Founder since 2015

About Me• a.k.a. Shi-Ken Don

• 2009 - 2014

• Ruby Developer since 2013

• 2014 - 2015

• Web Developer at Backer-Founder since 2015

200 20

Agenda

• What I did

• Why Ruby? Ruby

• Comparison

• Know-how

Crawler Architecture

Wikipedia:File:WebCrawlerArchitecture.svg

CrowdTrail

:D

CrowdTrail

• 14

• 2

•

DEMO

CrowdTrail :D

>>>> https://goo.gl/sWfDBc <<<<

• 5000

• Web

• 0

• 5000

• Web

• 0

Kickstarter 2 T T

•

• JRuby

• Ruby

Ruby

Ruby

Ruby

Ruby

Why, or Why not?

(Crawler) Python Python Ruby

Ruby

(Scrape web content)

Network flow

Google Chrome Developer Tools screenshot

1.75 1.34 76%

Ruby Python

user system total real Ruby Thread.new 16.090000 1.010000 17.100000 ( 17.254499) Ruby Parallel 16.900000 1.100000 18.000000 ( 18.080813) Python threading 0.000000 0.000000 16.360000 ( 16.532583)

1000 thread www.facebook.com

ruby_thread_tests.rbthreading_test.py

Python Ruby

https://gist.github.com/shikendon/b9c0e0041c8cec11e3f6bbe9fa987c0b

https://gist.github.com/shikendon/e47dcc03cb8f31048049c55bad533b61

user system total real Thread.new (parse and create) 5.640000 0.580000 6.220000 ( 6.073333) Thread.new (parse only) 5.060000 0.440000 5.500000 ( 5.455837) Thread.new (create directly) 3.340000 0.520000 3.860000 ( 5.169519)

ruby_thread_sidekiq_test.rb

100 thread www.facebook.com

parse only

https://gist.github.com/shikendon/8ea80f1132171e6fa14abc95725f55f3

• ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5.000 seconds (waited 5.009 seconds)

database.yml

default: &default adapter: postgresql encoding: unicode pool: 25 # Increase this

Thread

begin # … ProjectLog.create!(title: title) ensure ActiveRecord::Base.clear_active_connections! end

Sidekiq

• 200 8

• Ruby Thread

• Auto scaling

Heroku Auto Scaling

heroku.rake

task :auto_scaling, [:WORKER_NAME] => :environment do |_t, args| args.with_defaults(WORKER_NAME: "worker") APP_NAME = ENV["HEROKU_APP_NAME"] WORKER_NAME = args[:WORKER_NAME]

heroku = Heroku::API.new queues = Sidekiq::Queue.all queues_size = queues.map { |queue| Sidekiq::Queue.new(queue.name).size }.inject(0, :+)

# 2X dyno 600 jobs

# 50 parse project_log

# jobs 45 now_minutes = Time.now.strftime("%M").to_i # / / left_minutes = now_minutes.between?(0, 45) ? 45 - now_minutes : 0 workers_size = queues_size / 500 / [left_minutes, 1].max workers_size = 1 if workers_size < 1 workers_size = 10 if workers_size > 10 # 10 worker puts "Scaling #{WORKER_NAME} dyno count to #{workers_size}" heroku.post_ps_scale(APP_NAME, WORKER_NAME, workers_size) end

https://gist.github.com/shikendon/22d4dc9a3821a924898fd7de2b7fa56a

Sidekiq

• PostgreSQL > Redis > Sidekiq connections

Sidekiq Redis Thread Redis Thread

Heroku PostgreSQL Pricing

PostgreSQL Standard 2

• Rails App DB

• Rake Task DB

• Restarting

• Sidekiq MAX connections = 200

• File descriptor

✦ MacOS: 256 (default)

✦ Linux: 1024 (default)

✦ Windows: who cares

Linux File descriptor 1024

CPU RAM

Linux

• #

• cat /proc/sys/fs/file-max

• #

• sysctl -w fs.file-max=100000

Know-how

•

• Heroku dyno 500 Thread

• dynos#process-thread-limits

https://devcenter.heroku.com/articles/dynos#process-thread-limits

Queue (Weight)Sidekiq Queue Job

Queue

Project.find_each do |project| # Sidekiq queue name = project.platform.name SnapshotWorker.sidekiq_options_hash["queue"] = name SnapshotWorker.perform_async(…) end

Retry

sidekiq_options_hash["retry"] += 1 self.class.sidekiq_retry_in do |count| Random.rand(retry_after + count) end

• User-Agent

• Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)

• Google Bot

• Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

• User-Agent

• Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)

• Google Bot

• Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) SEO Googlebot ψ( ∇´)ψ

User-Agent DEFAULT_USER_AGENT = "Mozilla/5.0 (compatible; CrowdTrail/1.0; +https://crowdwatch.tw/)"

def random_user_agent_string format( "%s Random/0.%d.%d", DEFAULT_USER_AGENT, Random.rand(100), Random.rand(100) ) end

HTTParty.get("https://www.facebook.com", headers: { "User-Agent" => random_user_agent_string })

• header IP

• X-Forward-For

• X-Real-IP

• CF-Connecting-IP

X-Forward-For

def random_x_forward_for format( "140.118.%d.%d", Random.rand(255), Random.rand(255) ) end

HTTParty.get("https://www.facebook.com", headers: { "X-Forward-For" => random_x_forward_for })

Proxy

• Proxy Server

• Proxy

• http://txt.proxyspy.net/proxy.txt

http://txt.proxyspy.net/proxy.txt

CAPTCHA

• Ruby OCR

e.g. ruby-tesseract-ocr

• antigate.com

reCAPTCHA antigate

https://github.com/meh/ruby-tesseract-ocr

http://antigate.com

Parser Know-how

index

# Bad doc.css('.tab')[2].text

# Good doc.css('.tab').text[/ (\d+) /, 1]

DOM Parser Regular Expression

Integer Float to_i

# Bad doc.css('.tab .pledged').text.to_i

# Good Integer(doc.css('.tab .pledged').text)

to_i nil 0

THANK YOUFollow me on

https://github.com/shikendonhttps://medium.com/@shikendon

https://www.facebook.com/zxuandon

https://github.com/shikendon

https://medium.com/@shikendon

https://www.facebook.com/zxuandon

從零開始的爬蟲之旅 crawler from zero

Internet