從零開始的爬蟲之旅 crawler from zero
TRANSCRIPT
Crawler from [email protected] Taiwan 2016
About Me• a.k.a. Shi-Ken Don
• 2009 - 2014
• Ruby Developer since 2013
• 2014 - 2015
• Web Developer at Backer-Founder since 2015
About Me• a.k.a. Shi-Ken Don
• 2009 - 2014
• Ruby Developer since 2013
• 2014 - 2015
• Web Developer at Backer-Founder since 2015
200 20
Agenda
• What I did
• Why Ruby? Ruby
• Comparison
• Know-how
Crawler Architecture
Wikipedia:File:WebCrawlerArchitecture.svg
Crawler Architecture
Wikipedia:File:WebCrawlerArchitecture.svg
CrowdTrail
:D
CrowdTrail
• 14
• 2
•
DEMO
CrowdTrail :D
>>>> https://goo.gl/sWfDBc <<<<
• 5000
• Web
• 0
• 5000
• Web
• 0
Kickstarter 2 T T
•
• JRuby
• Ruby
Ruby
Ruby
Ruby
Ruby
Ruby
Why, or Why not?
(Crawler) Python Python Ruby
Ruby
Ruby
(Scrape web content)
Network flow
Google Chrome Developer Tools screenshot
1.75 1.34 76%
Ruby Python
user system total real Ruby Thread.new 16.090000 1.010000 17.100000 ( 17.254499) Ruby Parallel 16.900000 1.100000 18.000000 ( 18.080813) Python threading 0.000000 0.000000 16.360000 ( 16.532583)
1000 thread www.facebook.com
ruby_thread_tests.rbthreading_test.py
Python Ruby
user system total real Thread.new (parse and create) 5.640000 0.580000 6.220000 ( 6.073333) Thread.new (parse only) 5.060000 0.440000 5.500000 ( 5.455837) Thread.new (create directly) 3.340000 0.520000 3.860000 ( 5.169519)
ruby_thread_sidekiq_test.rb
100 thread www.facebook.com
parse only
• ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5.000 seconds (waited 5.009 seconds)
database.yml
default: &default adapter: postgresql encoding: unicode pool: 25 # Increase this
Thread
begin # … ProjectLog.create!(title: title) ensure ActiveRecord::Base.clear_active_connections! end
Sidekiq
• 200 8
• Ruby Thread
• Auto scaling
Heroku Auto Scaling
heroku.rake
task :auto_scaling, [:WORKER_NAME] => :environment do |_t, args| args.with_defaults(WORKER_NAME: "worker") APP_NAME = ENV["HEROKU_APP_NAME"] WORKER_NAME = args[:WORKER_NAME]
heroku = Heroku::API.new queues = Sidekiq::Queue.all queues_size = queues.map { |queue| Sidekiq::Queue.new(queue.name).size }.inject(0, :+)
# 2X dyno 600 jobs
# 50 parse project_log
# jobs 45 now_minutes = Time.now.strftime("%M").to_i # / / left_minutes = now_minutes.between?(0, 45) ? 45 - now_minutes : 0 workers_size = queues_size / 500 / [left_minutes, 1].max workers_size = 1 if workers_size < 1 workers_size = 10 if workers_size > 10 # 10 worker puts "Scaling #{WORKER_NAME} dyno count to #{workers_size}" heroku.post_ps_scale(APP_NAME, WORKER_NAME, workers_size) end
Sidekiq
• PostgreSQL > Redis > Sidekiq connections
Sidekiq Redis Thread Redis Thread
Heroku PostgreSQL Pricing
PostgreSQL Standard 2
• Rails App DB
• Rake Task DB
• Restarting
• Sidekiq MAX connections = 200
• File descriptor
✦ MacOS: 256 (default)
✦ Linux: 1024 (default)
✦ Windows: who cares
Linux File descriptor 1024
CPU RAM
Linux
• #
• cat /proc/sys/fs/file-max
• #
• sysctl -w fs.file-max=100000
Know-how
•
• Heroku dyno 500 Thread
• dynos#process-thread-limits
Queue (Weight)Sidekiq Queue Job
Queue
Project.find_each do |project| # Sidekiq queue name = project.platform.name SnapshotWorker.sidekiq_options_hash["queue"] = name SnapshotWorker.perform_async(…) end
Retry
sidekiq_options_hash["retry"] += 1 self.class.sidekiq_retry_in do |count| Random.rand(retry_after + count) end
• User-Agent
• Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)
• Google Bot
• Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
• User-Agent
• Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)
• Google Bot
• Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) SEO Googlebot ψ( ∇´)ψ
User-Agent DEFAULT_USER_AGENT = "Mozilla/5.0 (compatible; CrowdTrail/1.0; +https://crowdwatch.tw/)"
def random_user_agent_string format( "%s Random/0.%d.%d", DEFAULT_USER_AGENT, Random.rand(100), Random.rand(100) ) end
HTTParty.get("https://www.facebook.com", headers: { "User-Agent" => random_user_agent_string })
• header IP
• X-Forward-For
• X-Real-IP
• CF-Connecting-IP
X-Forward-For
def random_x_forward_for format( "140.118.%d.%d", Random.rand(255), Random.rand(255) ) end
HTTParty.get("https://www.facebook.com", headers: { "X-Forward-For" => random_x_forward_for })
CAPTCHA
• Ruby OCR
e.g. ruby-tesseract-ocr
• antigate.com
reCAPTCHA antigate
Parser Know-how
index
# Bad doc.css('.tab')[2].text
# Good doc.css('.tab').text[/ (\d+) /, 1]
DOM Parser Regular Expression
Integer Float to_i
# Bad doc.css('.tab .pledged').text.to_i
# Good Integer(doc.css('.tab .pledged').text)
to_i nil 0
THANK YOUFollow me on
https://github.com/shikendonhttps://medium.com/@shikendon
https://www.facebook.com/zxuandon