scrapoxy documentation - readthedocs.org

of 95 /95
Scrapoxy Documentation Release 3.1.1 Fabien Vauchelles Aug 17, 2018

Author: others

Post on 19-Feb-2022

0 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

Scrapoxy Documentation2 Documentation 5
3 Prerequisite 87
4 Contribute 89
5 License 91
It starts a pool of proxies to send your requests.
Now, you can crawl without thinking about blacklisting!
It is written in Javascript (ES6) with Node.js & AngularJS and it is open source!
1.1 How does Scrapoxy work ?
1. When Scrapoxy starts, it creates and manages a pool of proxies.
2. Your scraper uses Scrapoxy as a normal proxy.
3. Scrapoxy routes all requests through a pool of proxies.
1.2 What Scrapoxy does ?
• Create your own proxies
• Rotate IP addresses
• Impersonate known browsers
• Exclude blacklisted instances
• Monitor the requests
1.3 Why Scrapoxy doesn’t support anti-blacklisting ?
Anti-blacklisting is a job for the scraper.
When the scraper detects blacklisting, it asks Scrapoxy to remove the proxy from the proxies pool (through a REST API).
1.4 What is the best scraper framework to use with Scrapoxy ?
You could use the open source Scrapy framework (Python).
1.5 Does Scrapoxy have a SaaS mode or a support plan ?
Scrapoxy is an open source tool. Source code is highly maintained. You are very welcome to open an issue for features or bugs.
If you are looking for a commercial product in SaaS mode or with a support plan, we recommend you to check the ScrapingHub products (ScrapingHub is the company which maintains the Scrapy framework).
4 Chapter 1. What is Scrapoxy ?
Documentation
You can begin with the Quick Start or look at the Changelog.
Now, you can continue with Standard, and become an expert with Advanced.
And complete with Tutorials.
This tutorials works on AWS / EC2, with region eu-west-1.
See the AWS / EC2 - Copy an AMI from a region to another if you want to change region.
2.1.1 Step 1: Get AWS credentials
See Get AWS credentials.
See Create a security group.
2.1.3 Step 3A: Run Scrapoxy with Docker
Run the container:
Install Node.js
The minimum required version is 4.2.1.
Install Scrapoxy from NPM
Generate configuration
1. Edit conf.json
2. In the commander section, replace password by a password of your choice
3. In the providers/awsec2 section, replace accessKeyId, secretAccessKey and region by your AWS credentials and parameters.
Start Scrapoxy
2.1.6 Step 5: Connect Scrapoxy to your scraper
Scrapoxy is reachable at http://localhost:8888
6 Chapter 2. Documentation
1. Wait 3 minutes
scrapoxy test http://localhost:8888
2.2 Changelog
2.2.1 3.1.1
Bug fixes
• master: use correctly writeEnd in socket and request (thanks to Ben Lavalley)
2.2.2 3.1.0
• mitm: decrypt & encrypt SSL requests to add headers (like x-cache-proxyname). Compatible with HTTPS requests in PhantomJS.
• domains: manage whitelist or blacklist for urls (idea from Jonathan Wiklund)
• docs: add ami-485fbba5 with type t2.micro
Bug fixes
• docs: correct documentation
• ssl: add servername in the TLS connect (bug with HELLO)
• pinger: use reject instead of throw error (crash program). Thanks to Anis Gandoura !!!
2.2.3 3.0.1
Features
• digitalocean: support Digital Ocean tags on Droplets. Thanks to Ben Lavalley !!!
Bug fixes
2.2. Changelog 7
• providers: uses multiple providers at a time
• awsec2: provider removes instances in batch every second (and no longer makes thousands of queries)
• ovhcloud: provider creates instances in batch (new API route used)
Bug fixes
2.2.5 2.4.3
Bug fixes
• dependencies: upgrade dependencies to latest version
2.2.6 2.4.2
Bug fixes
• instance: force crashed instance to be removed
2.2.7 2.4.1
Bug fixes
• instance: correctly remove instance when instance is removed. Thanks to Étienne Corbillé!!!
2.2.8 2.4.0
Bug fixes
• proxy: use a valid startup script for init.d. Thanks to Hotrush!!!
• useragent: change useragents with a fresh list for 2017
8 Chapter 2. Documentation
Bug fixes
• instance: remove listeners on instance alive status on instance removal. Thanks to Étienne Corbillé!!!
2.2.10 2.3.9
• digitalocean: view only instances from selected region
• instances: remove random instances instead of the last ones
• pm2: add kill_timeout option for PM2 (thanks to cp2587)
Bug fixes
• digitalocean: limit the number of created instances at each API request
• digitalocean: don’t remove locked instances
2.2.11 2.3.8
Bug fixes
2.2.12 2.3.7
Features
• connect: scrapoxy accepts now full HTTPS CONNECT method. It is useful for browser like PhantomJS. Thanks to Anis Gandoura!!!
2.2.13 2.3.6
Bug fixes
2.2. Changelog 9
• ping: use an HTTP ping instead a TCP ping.
Please rebuild instance image.
• stats: add 3 more scales: 5m, 10m and 1h
• logs: normalize logs and add more informations
• scaling: pop a message when maximum number of instances is reached in a provider
• scaling: add quick scaling buttons
• docs: explain why Scrapoxy doesn’t accept CONNECT mode
• docs: explain how User Agent is overwritten
Bug fixes
• commander: manage twice instance remove
2.2.16 2.3.3
Bug fixes
• proxy: catch all socket errors in the proxy instance
2.2.17 2.3.2
Bug fixes
• docs: fallback to markdown for README (because npmjs doesn’t like retext)
10 Chapter 2. Documentation
Bug fixes
2.2.19 2.3.0
2.2.20 2.2.1
• doc: link to the new website Scrapoxy.io
2.2.21 2.2.0
Breaking changes
• node: node minimum version is now 4.2.1, to support JS class
Features
• all: migrate core and gui to ES6, with all best practices
• api: replace Express by Koa
Bug fixes
2.2.22 2.1.2
Bug fixes
2.2. Changelog 11
• main: add message when all instances are stopped (at end)
• doc: correct misc stuff in doc
2.2.24 2.1.0
• security: add basic auth to Scrapoxy (RFC2617)
• stats: add flow stats
• stats: store stats on server
• stats: add globals stats
• doc: split of the documentation in 3 parts: quick start, standard usage and advanced usage
• doc: add tutorials for AWS / EC2
• gui: add a scaling popup instead of direct edit (with integrity check)
• gui: add update popup when the status of an instance changes.
• gui: add error popup when GUI cannot retrieve data
• logs: write logs to disk
• instance: add cloud name
• instance: show instance IP
• instance: always terminate an instance when stopping (prefer terminate instead of stop/start)
• test: allow more than 8 requests (max 1000)
• ec2: force to terminate/recreate instance instead of stop/restart
Bug fixes
• gui: emit event when scaling is changed by engine (before, event was triggered by GUI)
• stability: correct a lot of behavior to prevent instance cycling
• ec2: use status name instead of status code
2.2.25 2.0.1
• test: specify the count of requests with the test command
• test: count the requests by IP in the test command
12 Chapter 2. Documentation
2.2.26 2.0.0
Breaking changes
Features
• gui: add GUI to control Scrapoxy
• gui: add statistics to the GUI (count of requests / minute, average delay of requests / minute)
• doc: add doc about HTTP headers
2.2.27 1.1.0
• commander: stopping an instance returns the new count of instances
• commander: password is hashed with base64
• commander: read/write config with command (and live update of the scaling)
Misc
2.2.28 1.0.2
Bug fixes
2.2. Changelog 13
2.2.30 1.0.0
• init: start of the project
2.3 The MIT License (MIT)
Copyright (c) 2016 Fabien Vauchelles
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documen- tation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PAR- TICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFT- WARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
2.4 Configure Scrapoxy
2.4.1 Create configuration
scrapoxy init conf.json
{ "commander": {
} }, "providers": [
"InstanceType": "t1.micro", "ImageId": "ami-c74d0db4", "SecurityGroups": [
"forward-proxy" ]
} }, {
}, {
}, {
} ]
}
2.4.3 Options: commander
Option Default value Description port 8889 TCP port of the REST API password none Password to access to the commander
2.4.4 Options: instance
Option Default value
Description
port none TCP port of your instance (example: 3128) username none Credentials if your proxy instance needs them (optional) password none Credentials if your proxy instance needs them (optional) scaling none see instance / scaling checkDelay 10000 (in ms) Scrapoxy requests the status of instances to the provider, every X
ms checkAliveDelay 20000 (in ms) Scrapoxy pings instances every X ms stopIfCrashedDe- lay
120000 (in ms) Scrapoxy restarts an instance if it has been dead for X ms
autorestart none see instance / autorestart
2.4.5 Options: instance / autorestart
The delay is between minDelay and maxDelay.
Option Default value Description minDelay 3600000 (in ms) Minimum delay maxDelay 43200000 (in ms) Maximum delay
2.4.6 Options: instance / scaling
Option Default value Description min none The desired count of instances when Scrapoxy is asleep max none The desired count of instances when Scrapoxy is awake required none The count of actual instances downscaleDelay 600000 (in ms) Time to wait to remove unused instances when Scrapoxy is not in use
2.4.7 Options: logs
Option Default value Description path none If specified, writes all logs in a dated file
16 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
2.4.8 Options: providers
Providers is an array of provider. It can contains multiple providers:
• AWS EC2: see AWS EC2 - Configuration
• OVH Cloud: see OVH Cloud - Configuration
• DigitalOcean: see DigitalOcean - Configuration
• Vscale: see Vscale - Configuration
Description
port 8888 TCP port of Scrapoxy auth none see proxy / auth (optional) domains_allowed [] Whitelisted domains: only URLs with this domains are allowed (ignored if
empty) do- mains_forbidden
[] Blacklisted domains: URLs with this domains are rejected (ignored if empty)
mitm False see man in the middle (optional)
2.4.10 Options: proxy / auth
Option Default value Description username none Credentials if your Scrapoxy needs them password none Credentials if your Scrapoxy needs them
2.4.11 Options: proxy / mitm
Option Default value Description cert_filename none Public key filename for MITM certificate (scrapoxy has a default one) key_filename none Private key filename for MITM certificate (scrapoxy has a default one)
2.4.12 Options: stats
Option Default value Description retention 86400000 (in ms) Duration of statistics retention samplingDelay 1000 (in ms) Get stats every X ms
2.4. Configure Scrapoxy 17
See Get credentials.
Step 2: Connect to your region
Step 3: Create a security group
See Create a security group.
Step 4: Choose an AMI
Public AMI are available for theses regions:
• eu-west-1 / t1.micro: ami-c74d0db4
• eu-west-1 / t2.micro: ami-485fbba5
• eu-west-1 / t2.nano: ami-06220275
If you cannot find your region, you can Copy an AMI from a region to another.
Step 5: Update configuration
} ]
}
2.5.2 Configure Scrapoxy
1. Add credentials in the configuration file;
2. Or Use your own credentials (from profile, see the AWS documentation).
Option Default value
Description
type none Must be awsec2 accessKeyId none Credentials for AWS (optional) secretAc- cessKey
none Credentials for AWS (optional)
region none AWS region (example: eu-west-1) tag Proxy Name of the AWS / EC2 instance instance none see awsec2 / instance max none Maximum number of instances for this provider. If empty, there is no maxi-
mum.
Scrapoxy use the method runInstances to create new instances.
Standard options are InstanceType, ImageId, KeyName, and SecurityGroups.
2.5.3 Tutorials
Step 1: Connect to your AWS console
Go to AWS console.
2.5. AWS / EC2 19
Step 3: Create a new key
1. Click on Access Key
2. Click on Create New Access Key
20 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
1. Click on Show Access Key
2. Get the values of Access Key ID and Secret Access Key
Tutorial: AWS / EC2 - Create a security group
Security groups (and AMI) are restricted to a region.
A security group in eu-west-1 is not available in eu-central-1.
Warning: You must create a security group by region.
Step 1: Connect to your AWS console
Go to AWS console.
2.5. AWS / EC2 21
1. Click on Security Groups
2. Click on Create Security Group
22 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
1. Fill the name and the description with forward-proxy
2. Fill the Inbound rule with:
• Type: Custom TCP Rule
3. Click on Create
2.5. AWS / EC2 23
Scrapoxy Documentation, Release 3.1.1
Tutorial: AWS / EC2 - Copy an AMI from a region to another
AMI (and security groups) are restricted to a region.
A AMI in eu-west-1 is not available in eu-central-1.
Warning: You must create an AMI by region.
Step 1: Connect to your AWS console
Go to AWS console.
24 Chapter 2. Documentation
1. Click on AMIs
1. Right click on instance
2. Click on Copy AMI
Step 6: Start AMI copy
1. Choose the new destination region
2. Click on Copy AMI
26 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
The new AMI ID is in the column AMI ID.
2.5. AWS / EC2 27
Scrapoxy Documentation, Release 3.1.1
See Get DigitalOcean credentials.
See Create a SSH key.
Remember your SSH key name (mykey).
Step 3: Create an image
See Create an image.
Step 4: Update configuration
] },
28 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Options: digitalocean
Option Default value Description type none Must be digitalocean token none Credentials for DigitalOcean region none DigitalOcean region (example: lon1) sshKeyName none Name of the SSH key size none Type of droplet name Proxy Name of the droplet imageName none Name of the image (for the proxy droplet) tags none Tags separated by a comma (example: proxy,instance) max none Maximum number of instances for this provider. If empty, there is no maximum.
2.6.3 Tutorials
Go to DigitalOcean console.
1. Click on API
2.6. DigitalOcean 29
2. Click on Generate Token
Step 4: Get the credentials
Remember the token
Step 1: Connect to your DigitalOcean console
Go to DigitalOcean console.
1. Click on the top right icon
2. Click on Settings
1. Click on Security
2.6. DigitalOcean 31
1. Paste your SSH key
2. Enter mykey for the name
3. Click on Add SSH Key
You can generate your key with this tutorial on Github.
And remember the name of the key!
Tutorial: DigitalOcean - Create an image
Step 1: Connect to your DigitalOcean console
Go to DigitalOcean console.
32 Chapter 2. Documentation
Click on Create and Droplets
Step 3: Change the configuration of droplet
Choose an image Ubuntu 16.04.3 x64:
Choose the smallest size on Standard Droplets:
2.6. DigitalOcean 33
Use the SSH key named mykey:
Step 4: Start the droplet
Click on Create
2.6. DigitalOcean 35
Get the IP:
ssh [email protected]<replace by IP>
Step 6: Install the proxy
Install proxy with:
and:
and:
and:
1. Stop the last command (CTRL-C)
2. Power off the droplet:
sudo poweroff
Step 8: Create a backup
1. Click on Images
2. Select your droplet
4. Click on Take Snapshot
2.6. DigitalOcean 37
2.7 OVH Cloud
2.7.1 Get started
See Get OVH credentials.
Step 2: Create a project
See Create a project.
See Create a SSH key.
Remember your SSH key name (mykey).
Step 4: Create a proxy image
See Create a proxy image.
Remember your image name (forward-proxy).
Step 5: Update configuration
] },
2.7.2 Configure Scrapoxy
Description
type none Must be ovhcloud endpoint none OVH subdivision (ovh-eu or ovh-ca) appKey none Credentials for OVH appSecret none Credentials for OVH consumerKey none Credentials for OVH serviceId none Project ID region none OVH region (example: GRA1) sshKeyName none Name of the SSH key flavorName none Type of instance name Proxy Name of the instance snapshot- Name
none Name of the backup image (for the proxy instance)
max none Maximum number of instances for this provider. If empty, there is no maximum.
2.7. OVH Cloud 39
Scrapoxy Documentation, Release 3.1.1
Step 1: Create an API Application
1. Go on https://eu.api.ovh.com/createApp/
3. Add a name name (e.g.: scrapoxy-12)
4. Add a description (e.g.: scrapoxy)
5. Click on Create keys
Step 2: Save application credentials
Remember Application Key and Application Secret:
40 Chapter 2. Documentation
Use Scrapoxy to get your key:
scrapoxy ovh-consumerkey <endpoint> <Application Key> <Application Secret>
Endpoints are:
Remember consumerKey and click on validation URL to validate key.
Step 4: Add permission
2. Choose Unlimited validity
2.7. OVH Cloud 41
Scrapoxy Documentation, Release 3.1.1
Step 1: Connect to your OVH dashboard
Go to OVH Cloud.
1. Click on Create a new project
2. Fill project name
42 Chapter 2. Documentation
Remember the ID in the URL: it is the project serviceId.
Tutorial: OVH Cloud - Create a SSH key
Step 1: Connect to your new project
Go to OVH dashboard and select your new project.
Step 2: Create a new key
1. Click on the tab SSH keys
2. Click on Add a key
2.7. OVH Cloud 43
Scrapoxy Documentation, Release 3.1.1
1. Set the name of the key mykey
2. Copy your key
You can generate your key with this tutorial on Github.
And remember the name of the key!
Tutorial: OVH Cloud - Create a proxy image
Step 1: Connect to your new project
Go to OVH dashboard and select your new project.
Step 2: Create a new server
1. Click on Infrastructure
2. Click on Add
44 Chapter 2. Documentation
1. Change the type of server to VPS-SSD-1 (cheapest)
2. Change the distribution to Ubuntu
3. Change region to GRA1 (or another if you want)
4. Change the SSH key to mykey
5. Click on Launch now
2.7. OVH Cloud 45
Scrapoxy Documentation, Release 3.1.1
46 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
1. Click on the v on the top right corner
2. Click on Login information
Step 5: Connect to the instance
Remember the SSH command.
Connect to the instance and install proxy:
2.7. OVH Cloud 47
Scrapoxy Documentation, Release 3.1.1
sudo apt-get install curl
and:
and:
and:
Go back on the OVH project dashboard:
1. Click on the v on the top right corner
2. Click on Create a backup
48 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
1. Enter the snapshot name forward-proxy
2. Click on Take a snapshot
You need to wait 10 minutes to 1 hour.
2.7. OVH Cloud 49
Scrapoxy Documentation, Release 3.1.1
Vscale is a russian cloud platform, like DigitalOcean.
Note: IP addresses are updated every hour. If the instances are restarted too quickly, they will have the same IP address.
2.8.1 Get started
See Get Vscale credentials.
See Create a SSH key.
Remember your SSH key name (mykey).
Step 3: Create an image
See Create an image.
Step 4: Update configuration
] },
50 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Options: vscale
Option Default value Description type none Must be vscale token none Credentials for Vscale region none Vscale region (example: msk0, spb0) sshKeyName none Name of the SSH key plan none Type of plan (example: small) name Proxy Name of the scalet imageName none Name of the image (for the proxy scalet) max none Maximum number of instances for this provider. If empty, there is no maximum.
2.8.3 Tutorials
Go to Vscale console.
1. Click on SETTINGS
2.8. Vscale 51
2. Click on GENERATE TOKEN
Step 4: Get the credentials
1. Click on COPY TO CLIPBOARD
2. Click on SAVE
Step 1: Connect to your Vscale console
Go to Vscale console.
1. Click on SETTINGS
3. Click on ADD SSH KEY
2.8. Vscale 53
2. Paste your SSH key
3. Click on ADD
54 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
You can generate your key with this tutorial on Github.
And remember the name of the key!
Tutorial: Vscale - Create an image
Step 1: Connect to your Vscale console
Go to Vscale console.
1. Click on SERVERS
Step 3: Change the configuration of scalet
Choose an image Ubuntu 16.04 64bit:
2.8. Vscale 55
56 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Step 4: Create the scalet
Click on CREATE SERVER
Get the IP:
2.8. Vscale 57
ssh [email protected]<replace by IP>
Step 6: Install the proxy
Install proxy with:
sudo apt-get update
and:
and:
and:
and:
1. Stop the last command (CTRL-C)
2. Power off the scalet:
sudo poweroff
Step 8: Open the scalet
Click on the scalet:
60 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
2.9 Manage Scrapoxy with a GUI
2.9.1 Connect
2.9. Manage Scrapoxy with a GUI 61
The password is defined in the configuration file, key commander.password.
2.9.3 Dashboard
Scrapoxy GUI has many pages:
• Instances. This page contains the list of instances managed by Scrapoxy;
• Stats. This page contains statistics on the use of Scrapoxy.
To login page redirects to the Instances page.
62 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Scrapoxy has 3 settings:
• Min. The desired count of instances when Scrapoxy is asleep;
• Max. The desired count of instances when Scrapoxy is awake;
• Required. The count of actual instances.
To add or remove an instance, click on the Scaling button and change the Required setting:
Status of an instance
This panel contains many information:
• Name of the instance;
• IP of the instance;
• Instance status in Scrapoxy.
Scrapoxy relays requests to instances which are started and alived ( + ).
2.9. Manage Scrapoxy with a GUI 63
Scrapoxy Documentation, Release 3.1.1
The instance stops and is replaced by another.
2.9.5 Page: Stats
• Global stats. This panel contains global stats;
• Requests. This panel contains the count of requests;
• Flow. This panel contains the flow requests.
64 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
• the total count of requests to monitor performance;
• the total count of received and sent data to control the volume of data;
• the total of stop instance orders, to monitor anti-blacklisting;
• the count of requests received by an instance (minimum, average, maximum) to check anti-blacklisting per- formance.
Requests
It measures:
• the count of requests per minute;
• the average execution time of a request (round trip), per minute.
2.9. Manage Scrapoxy with a GUI 65
Scrapoxy Documentation, Release 3.1.1
It measures:
How to increase the number of requests per minute ?
You add new instances (or new scrapers).
Do you increase the number of requests par minute ?
• Yes: Perfect!
You add new instances (or new scrapers).
Did the time of response increase ?
• Yes: The target website is overloaded.
• No: Perfect!
66 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
• the commander, which provides a REST API to receive orders;
• the gui, which connects to the REST API.
When Scrapoxy starts, the manager starts a new instance (if necessary), on the cloud.
When the scraper sends a HTTP request, the manager starts all others proxies.
2.10.2 Requests
2.10. Understand Scrapoxy 67
Scrapoxy Documentation, Release 3.1.1
Mode A: HTTPS CONNECT with MITM (Man-In-The-Middle)
This mode is for unsecure browser like PhantomJS. It allow Scrapoxy to decrypt SSL and override HTTP headers (like User-Agent).
This solution can trigger some SSL alerts.
Mode B: HTTPS CONNECT without MITM
This mode is for secure browser. It doesn’t allow Scrapoxy to override HTTP headers (like User-Agent). You must manually set the User-Agent.
The best solution is to use only 1 User-Agent (it would be strange to have multiple User-Agents coming from 1 IP, isn’t it?).
Mode C: HTTPS over HTTP (or *no tunnel* mode)
This mode is for scraper. It allows Scrapoxy to override HTTP headers (like User-Agent).
The scraper must send a HTTP request with an HTTPS URL in the Location header.
Example:
GET /index.html Host: localhost:8888 Location: https://www.google.com/index.html Accept: text/html
Scrapers accept a GET (or POST) method instead of CONNECT for proxy.
With Scrapy (Python), add /?noconnect to the proxy URL:
PROXY='http://localhost:8888/?noconnect
request({ method: 'GET', url: 'https://api.ipify.org/', tunnel: false, proxy: 'http://localhost:8888',
}, (err, response, body) => {...});
Scrapoxy adds to the response an HTTP header x-cache-proxyname.
This header contains the name of the proxy.
If you are using HTTPS in HTTPS CONNECT without MITM, Scrapoxy is unable to add this header since the traffic is encrypted.
Can the scraper force the request to go through a specific proxy?
Yes. The scraper adds the proxy name in the header x-cache-proxyname.
When the scraper receives a response, this header is extracted. The scraper adds this header to the next request.
68 Chapter 2. Documentation
Does Scrapoxy override User Agent ?
Yes. When an instance starts (or restarts), it gets a random User Agent (from the User Agent list).
When the instance receives a request, it overrides the User Agent.
2.10.3 Blacklisting
How can you manage blacklisted response ?
Remember, Scrapoxy cannot detect blacklisted response because it is too specific to a scraping usecase. It can be a 503 HTTP response, a captcha, a longer response, etc.
Anti-blacklisting is a job for the scraper:
1. The scraper must detect a blacklisted response;
2. The scraper extracts the name of the instance from the HTTP response header (see here);
3. The scraper asks to Scrapoxy to remove the instance with the API (see here).
When the blacklisted response is detected, Scrapoxy will replace the instance with a valid one (new IP address).
There is a tutorial: Manage blacklisted request with Scrapy.
2.10.4 Instances management
How does multi-providers work ?
In the configuration file, you can specify multiple providers (the providers field is an array).
You can also specify the maximum number of instances by provider, with the max parameter (for example: 2 instances maximum for AWSEC2 and unlimited for DigitalOcean).
When several instances are requested, the algorithm randomly asks the instances at the providers, within the specified capacities.
How does the monitoring mechanism ?
1. the manager asks the cloud how many instances are alive. It is the initial state;
2. the manager creates a target state, with the new count of instance;
3. the manager generates the commands to reach target state from the initial state;
4. the manager sends the commands to the cloud.
These steps are very important because you cannot guess which is the initial state. Because an instance may be dead!
Scrapoxy can restart an instance if:
• the instance is dead (stop status or no ping);
• the living limit is reached: Scrapoxy regulary restarts the instance to change the IP address.
2.10. Understand Scrapoxy 69
Do you need to create a VM image ?
By default, we provide you an AMI proxy instance on AWS / EC2. This is a CONNECT proxy opened on TCP port 3128.
But you can use every software which accept the CONNECT method (Squid, Tinyproxy, etc.).
Can you leave Scrapoxy started ?
Yes. Scrapoxy has 2 modes: an awake mode and an asleep mode.
When Scrapoxy receives no request after a while, he falls asleep. It sets the count of instances to minimum (instance.scaling.min).
When Scrapoxy receives a request, it wakes up. It fixes the count of instances to maximum (instance.scaling.max).
Scrapoxy needs at least 1 instance to receive the awake request.
2.11 Control Scrapoxy with a REST API
2.11.1 Endpoint
2.11.2 Authentication
Every requests must have an Authorization header.
The value is the hash base64 of the password set in the configuration (commander/password).
70 Chapter 2. Documentation
Example:
'Authorization': new Buffer(password).toString('base64'), },
console.log('Status: %d\n\n', res.statusCode);
const bodyParsed = JSON.parse(body);
}
Scrapoxy Documentation, Release 3.1.1
The instance exists.
Scrapoxy stops it. And the instance is restarted, with a new IP address.
The body contains the remaining count of alive instances.
{ "alive": <count>
Example:
const opts = { method: 'POST', url: 'http://localhost:8889/api/instances/stop', json: {
name: instanceName, }, headers: {
'Authorization': new Buffer(password).toString('base64'), },
console.log('Status: %d\n\n', res.statusCode);
console.log(body); });
Example:
'Authorization': new Buffer(password).toString('base64'), },
console.log('Status: %d\n\n', res.statusCode);
const bodyParsed = JSON.parse(body);
}
Example:
Scrapoxy Documentation, Release 3.1.1
(continued from previous page)
min: 1, required: 5, max: 10,
}, headers: {
console.log('Status: %d\n\n', res.statusCode);
console.log(body); });
The body contains all the configuration of Scrapoxy (including scaling).
Example:
'Authorization': new Buffer(password).toString('base64'), },
(continues on next page)
74 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
(continued from previous page)
const bodyParsed = JSON.parse(body);
} }
Example:
instance: { scaling: {
max: 300, },
Scrapoxy Documentation, Release 3.1.1
(continued from previous page)
console.log('Status: %d\n\n', res.statusCode);
console.log(body); });
Scrapoxy supports standard HTTP Basic auth (RFC2617).
Step 1: Add username and password in configuration
Open conf.json and add auth section in proxy section (see Configure Scrapoxy):
{ "proxy": {
} }
}
Configure your scraper to use username and password:
The URL is: http://myuser:[email protected]:8888
Step 3: Test credentials
scrapoxy test http://localhost:8888
scrapoxy test http://myuser:[email protected]:8888
76 Chapter 2. Documentation
UFW simplifies IPTables on Ubuntu (>14.04).
Step 1: Allow SSH
sudo ufw allow ssh
Step 2: Allow Scrapoxy
Step 3: Enable UFW
sudo ufw status
2.13.1 Prerequise
Scrapoxy is installed with a valid configuration (see Quick Start).
2.13.2 Step 1: Install PM2
sudo npm install -g pm2
2.13.3 Step 2: Launch PM2 at instance startup
sudo pm2 startup ubuntu -u <YOUR USERNAME>
1. Replace ubuntu by your distribution name (see PM2 documentation).
2. Replace YOUR USERNAME by your Linux username
2.13. Launch Scrapoxy at startup 77
Create a PM2 configuration file scrapoxy.json5 for Scrapoxy:
{ apps : [ {
}, ],
}
pm2 start scrapoxy.json5
pm2 save
If you need to stop Scrapoxy in PM2:
pm2 stop scrapoxy.json5
2.14.1 Goal
Is it easy to find a good Python developer on Paris ? No!
So, it’s time to build a scraper with Scrapy to find our perfect profile.
The site Scraping Challenge indexes a lot of profiles (fake, for demo purposes). We want to grab them and create a CSV file.
However, the site is protected against scraping ! We must use Scrapoxy to bypass the protection.
2.14.2 Step 1: Install Scrapy
Install Python 2.7
78 Chapter 2. Documentation
On Windows (with Babun):
Install Scrapy and Scrapoxy connector
pip install scrapy scrapoxy
Create a new project
scrapy startproject myscraper cd myscraper
Add a new spider
# -*- coding: utf-8 -*-
class Scraper(Spider): name = u'scraper'
def start_requests(self): """This is our first request to grab all the urls of the profiles. """ yield Request(
url=u'http://scraping-challenge-2.herokuapp.com', callback=self.parse,
)
def parse(self, response): """We have all the urls of the profiles. Let's make a request for each
→profile. """ urls = response.xpath(u'//a/@href').extract() for url in urls:
yield Request( url=response.urljoin(url),
(continues on next page)
callback=self.parse_profile, )
def parse_profile(self, response): """We have a profile. Let's extract the name """ name_el = response.css(u'.profile-info-name::text').extract() if len(name_el) > 0:
yield { 'name': name_el[0]
}
If you want to learn more about Scrapy, check on this Tutorial.
Run the spider
Run this command:
Scrapy scraps the site and extract profiles to profiles.csv.
However, Scraping Challenge is protected! profiles.csv is empty. . .
We will integrate Scrapoxy to bypass the protection.
2.14.4 Step 3: Integrate Scrapoxy to the Scrapy
Install Scrapoxy
Start Scrapoxy
Set the maximum of instances to 6, and start Scrapoxy (see Change scaling with GUI).
Warning: Don’t forget to set the maximum of instances!
Edit settings of the Scraper
Add this content to myscraper/settings.py:
CONCURRENT_REQUESTS_PER_DOMAIN = 1 RETRY_TIMES = 0
What are these middlewares ?
• ProxyMiddleware relays requests to Scrapoxy. It is an helper to set the PROXY parameter.
• WaitMiddleware stops the scraper and waits for Scrapoxy to be ready.
• ScaleMiddleware asks Scrapoxy to maximize the number of instances at the beginning, and to stop them at the end.
Note: ScaleMiddleware stops the scraper like WaitMiddleware. After 2 minutes, all instances are ready and the scraper continues to scrap.
Warning: Don’t forget to change the password!
Run the spider
Run this command:
Now, all profiles are saved to profiles.csv!
2.15 Integrate Scrapoxy to Node.js
2.15.1 Goal
Is it easy to find a good Javascript developer on Paris ? No!
So, it’s time to build a scraper with Node.js, Request and Cheerio to find our perfect profile.
The site Scraping Challenge indexes a lot of profiles (fake, for demo purposes). We want to list them.
However, the site is protected against scraping ! We must use Scrapoxy to bypass the protection.
2.15.2 Step 1: Create a Node.js project
Install Node.js
2.15. Integrate Scrapoxy to Node.js 81
mkdir nodejs-request cd nodejs-request
What are these dependencies ?
• requests makes HTTP requests,
• winston is a logger.
winston.level = 'debug';
opts: { },
winston.info('Wait 120 seconds to scale instances');
return urls; })
.then((urls) => {
(url) => getProfile(url, config.opts) .then((profile) => {
}); }) .catch((err) => winston.error('Error: ', err));
* @param url Main URL
* @returns {promise}
*/ function getProfilesUrls(url, defaultOpts) {
return new Promise((resolve, reject) => { // Create options for the HTTP request // Add the URL to the default options const opts = _.merge({}, defaultOpts, {url});
request(opts, (err, res, body) => { if (err) {
return reject(err); }
}
// Extract all urls const urls = $('.profile a')
.map((i, el) => $(el).attr('href')) (continues on next page)
2.15. Integrate Scrapoxy to Node.js 83
Scrapoxy Documentation, Release 3.1.1
(continued from previous page)
* @param url URL of the profile
* @param defaultOpts options for http request
* @returns {promise}
*/ function getProfile(url, defaultOpts) {
return new Promise((resolve, reject) => { // Create options for the HTTP request // Add the URL to the default options const opts = _.merge({}, defaultOpts, {url});
request(opts, (err, res, body) => { if (err) {
return reject(err); }
}
// Extract the names const name = $('.profile-info-name').text();
resolve({name}); });
Run this command:
84 Chapter 2. Documentation
Install Scrapoxy
Start Scrapoxy
Set the maximum of instances to 6, and start Scrapoxy (see Change scaling with GUI).
Warning: Don’t forget to set the maximum of instances!
Edit the script
opts: { // URL of Scrapoxy proxy: 'http://localhost:8888',
// HTTPS over HTTP tunnel: false,
} };
2.16 Manage blacklisted request with Scrapy
2.16.1 Goal
A scraper is downloading pages of a website.
However, the website has a rate limit by IP. When the scraper downloads 10 pages, the website returns only an empty page with a HTTP 429 status.
Does the scraper must wait when the limit is reached ? No!
The scraper has to ask Scrapoxy to replace the instance.
2.16. Manage blacklisted request with Scrapy 85
See Integrate Scrapoxy to Scrapy to create the scraper.
Edit settings of Scrapoxy
CONCURRENT_REQUESTS_PER_DOMAIN = 1 RETRY_TIMES = 0
Edit settings of the Scraper
Change the password of the commander in my-config.json:
"commander": { "password": "CHANGE_THIS_PASSWORD"
86 Chapter 2. Documentation
Contribute
You can open an issue on this repository for any feedback (bug, question, request, pull request, etc.).
See the License.
And don’t forget to be POLITE when you write your scrapers!
91