Download - Scrapoxy Documentation - readthedocs.org
Get Started
1 What is Scrapoxy ? 3
2 Documentation 5
3 Prerequisite 87
4 Contribute 89
5 License 91
i
CHAPTER 1
What is Scrapoxy ?
http://scrapoxy.io
Scrapoxy hides your scraper behind a cloud.
It starts a pool of proxies to send your requests.
Now, you can crawl without thinking about blacklisting!
It is written in Javascript (ES6) with Node.js & AngularJS and it is open source!
1.1 How does Scrapoxy work ?
1. When Scrapoxy starts, it creates and manages a pool of proxies.
2. Your scraper uses Scrapoxy as a normal proxy.
3. Scrapoxy routes all requests through a pool of proxies.
1.2 What Scrapoxy does ?
• Create your own proxies
• Use multiple cloud providers (AWS, DigitalOcean, OVH, Vscale)
• Rotate IP addresses
• Impersonate known browsers
• Exclude blacklisted instances
• Monitor the requests
• Detect bottleneck
3
Scrapoxy Documentation, Release 3.1.1
• Optimize the scraping
1.3 Why Scrapoxy doesn’t support anti-blacklisting ?
Anti-blacklisting is a job for the scraper.
When the scraper detects blacklisting, it asks Scrapoxy to remove the proxy from the proxies pool (through a RESTAPI).
1.4 What is the best scraper framework to use with Scrapoxy ?
You could use the open source Scrapy framework (Python).
1.5 Does Scrapoxy have a SaaS mode or a support plan ?
Scrapoxy is an open source tool. Source code is highly maintained. You are very welcome to open an issue for featuresor bugs.
If you are looking for a commercial product in SaaS mode or with a support plan, we recommend you to check theScrapingHub products (ScrapingHub is the company which maintains the Scrapy framework).
4 Chapter 1. What is Scrapoxy ?
CHAPTER 2
Documentation
You can begin with the Quick Start or look at the Changelog.
Now, you can continue with Standard, and become an expert with Advanced.
And complete with Tutorials.
2.1 Quick Start
This tutorials works on AWS / EC2, with region eu-west-1.
See the AWS / EC2 - Copy an AMI from a region to another if you want to change region.
2.1.1 Step 1: Get AWS credentials
See Get AWS credentials.
2.1.2 Step 2: Create a security group
See Create a security group.
2.1.3 Step 3A: Run Scrapoxy with Docker
Run the container:
sudo docker run -e COMMANDER_PASSWORD='CHANGE_THIS_PASSWORD' \-e PROVIDERS_AWSEC2_ACCESSKEYID='YOUR ACCESS KEY ID' \-e PROVIDERS_AWSEC2_SECRETACCESSKEY='YOUR SECRET ACCESS KEY' \-it -p 8888:8888 -p 8889:8889 fabienvauchelles/scrapoxy
5
Scrapoxy Documentation, Release 3.1.1
Warning: Replace PROVIDERS_AWSEC2_ACCESSKEYID and PROVIDERS_AWSEC2_SECRETACCESSKEYby your AWS credentials and parameters.
2.1.4 Step 3B: Run Scrapoxy without Docker
Install Node.js
See the Node Installation Manual.
The minimum required version is 4.2.1.
Install Scrapoxy from NPM
Install make:
sudo apt-get install build-essential
And Scrapoxy:
sudo npm install -g scrapoxy
Generate configuration
scrapoxy init conf.json
Edit configuration
1. Edit conf.json
2. In the commander section, replace password by a password of your choice
3. In the providers/awsec2 section, replace accessKeyId, secretAccessKey and region by your AWS credentials andparameters.
Start Scrapoxy
scrapoxy start conf.json -d
2.1.5 Step 4: Open Scrapoxy GUI
Scrapoxy GUI is reachable at http://localhost:8889
2.1.6 Step 5: Connect Scrapoxy to your scraper
Scrapoxy is reachable at http://localhost:8888
6 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
2.1.7 Step 6: Test Scrapoxy
1. Wait 3 minutes
2. Test Scrapoxy in a new terminal with:
scrapoxy test http://localhost:8888
3. Or with cURL:
curl --proxy http://127.0.0.1:8888 http://api.ipify.org
2.2 Changelog
2.2.1 3.1.1
Bug fixes
• master: use correctly writeEnd in socket and request (thanks to Ben Lavalley)
2.2.2 3.1.0
Features
• mitm: decrypt & encrypt SSL requests to add headers (like x-cache-proxyname). Compatible with HTTPSrequests in PhantomJS.
• domains: manage whitelist or blacklist for urls (idea from Jonathan Wiklund)
• docs: add ami-485fbba5 with type t2.micro
Bug fixes
• logs: correct export path of logs
• docs: correct documentation
• ssl: add servername in the TLS connect (bug with HELLO)
• pinger: use reject instead of throw error (crash program). Thanks to Anis Gandoura !!!
2.2.3 3.0.1
Features
• digitalocean: support Digital Ocean tags on Droplets. Thanks to Ben Lavalley !!!
Bug fixes
• digitalocean: use new image size (s-1vcpu-1gb instead of 512mb)
2.2. Changelog 7
Scrapoxy Documentation, Release 3.1.1
2.2.4 3.0.0
Features
Warning: BREAKING CHANGE! The configuration of providers changes. See documentation here.
• providers: uses multiple providers at a time
• awsec2: provider removes instances in batch every second (and no longer makes thousands of queries)
• ovhcloud: provider creates instances in batch (new API route used)
Bug fixes
• maxRunningInstances: remove blocking parameters maxRunningInstances
2.2.5 2.4.3
Bug fixes
• node: change minimum version of Node.js to 8
• dependencies: upgrade dependencies to latest version
2.2.6 2.4.2
Bug fixes
• useragent: set useragent at instance creation, not at startup
• instance: force crashed instance to be removed
2.2.7 2.4.1
Bug fixes
• instance: correctly remove instance when instance is removed. Thanks to Étienne Corbillé!!!
2.2.8 2.4.0
Features
• provider: add VScale.io provider. Thanks to Hotrush!!!
Bug fixes
• proxy: use a valid startup script for init.d. Thanks to Hotrush!!!
• useragent: change useragents with a fresh list for 2017
8 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
2.2.9 2.3.10
Features
• docs: add ami-06220275 with type t2.nano
Bug fixes
• instance: remove listeners on instance alive status on instance removal. Thanks to Étienne Corbillé!!!
2.2.10 2.3.9
Features
• digitalocean: update Digital Ocean documentation
• digitalocean: view only instances from selected region
• instances: remove random instances instead of the last ones
• pm2: add kill_timeout option for PM2 (thanks to cp2587)
Bug fixes
• digitalocean: limit the number of created instances at each API request
• digitalocean: don’t remove locked instances
2.2.11 2.3.8
Features
• docker: create the Docker image fabienvauchelles/scrapoxy
Bug fixes
• template: limit max instances to 2
2.2.12 2.3.7
Features
• connect: scrapoxy accepts now full HTTPS CONNECT method. It is useful for browser like PhantomJS.Thanks to Anis Gandoura!!!
2.2.13 2.3.6
Bug fixes
• template: replace old AWS AMI by ami-c74d0db4
2.2. Changelog 9
Scrapoxy Documentation, Release 3.1.1
2.2.14 2.3.5
Features
• instance: change Node.js version to 6.x
• ping: use an HTTP ping instead a TCP ping.
Please rebuild instance image.
2.2.15 2.3.4
Features
• stats: monitor stop count history
• stats: add 3 more scales: 5m, 10m and 1h
• logs: normalize logs and add more informations
• scaling: pop a message when maximum number of instances is reached in a provider
• scaling: add quick scaling buttons
• docs: explain why Scrapoxy doesn’t accept CONNECT mode
• docs: explain how User Agent is overwritten
Bug fixes
• dependencies: upgrade dependencies
• ovh: monitor DELETED status
• docs: add example to test scrapoxy with credentials
• commander: manage twice instance remove
2.2.16 2.3.3
Bug fixes
• master: sanitize bad request headers
• proxy: catch all socket errors in the proxy instance
2.2.17 2.3.2
Bug fixes
• docs: fallback to markdown for README (because npmjs doesn’t like retext)
10 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
2.2.18 2.3.1
Features
• docs: add tutorials for Scrapy and Node.js
Bug fixes
• digitalocean: convert Droplet id to string
2.2.19 2.3.0
Features
• digitalocean: add support for DigitalOcean provider
2.2.20 2.2.1
Misc
• config: rename my-config.json to conf.json
• doc: migrate documentation to ReadTheDocs
• doc: link to the new website Scrapoxy.io
2.2.21 2.2.0
Breaking changes
• node: node minimum version is now 4.2.1, to support JS class
Features
• all: migrate core and gui to ES6, with all best practices
• api: replace Express by Koa
Bug fixes
• test: correct core e2e test
2.2.22 2.1.2
Bug fixes
• gui: correct token encoding for GUI
2.2. Changelog 11
Scrapoxy Documentation, Release 3.1.1
2.2.23 2.1.1
Bug fixes
• main: add message when all instances are stopped (at end)
• doc: correct misc stuff in doc
2.2.24 2.1.0
Features
• ovh: add OVH provider with documentation
• security: add basic auth to Scrapoxy (RFC2617)
• stats: add flow stats
• stats: add scale for stats (1m/1h/1d)
• stats: store stats on server
• stats: add globals stats
• doc: split of the documentation in 3 parts: quick start, standard usage and advanced usage
• doc: add tutorials for AWS / EC2
• gui: add a scaling popup instead of direct edit (with integrity check)
• gui: add update popup when the status of an instance changes.
• gui: add error popup when GUI cannot retrieve data
• logs: write logs to disk
• instance: add cloud name
• instance: show instance IP
• instance: always terminate an instance when stopping (prefer terminate instead of stop/start)
• test: allow more than 8 requests (max 1000)
• ec2: force to terminate/recreate instance instead of stop/restart
Bug fixes
• gui: emit event when scaling is changed by engine (before, event was triggered by GUI)
• stability: correct a lot of behavior to prevent instance cycling
• ec2: use status name instead of status code
2.2.25 2.0.1
Features
• test: specify the count of requests with the test command
• test: count the requests by IP in the test command
12 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
• doc: add GUI documentation
• doc: add API documentation
• doc: explain awake/asleep mode in user manual
• log: add human readable message at startup
2.2.26 2.0.0
Breaking changes
• commander: API routes are prefixed with /api
Features
• gui: add GUI to control Scrapoxy
• gui: add statistics to the GUI (count of requests / minute, average delay of requests / minute)
• doc: add doc about HTTP headers
2.2.27 1.1.0
Features
• commander: stopping an instance returns the new count of instances
• commander: password is hashed with base64
• commander: read/write config with command (and live update of the scaling)
Misc
• chore: force global install with NPM
2.2.28 1.0.2
Features
• doc: add 2 AWS / EC2 tutorials
Bug fixes
• template: correct template mechanism
• config: correct absolute path for configuration
2.2. Changelog 13
Scrapoxy Documentation, Release 3.1.1
2.2.29 1.0.1
Misc
• doc: change author and misc informations
2.2.30 1.0.0
Features
• init: start of the project
2.3 The MIT License (MIT)
Copyright (c) 2016 Fabien Vauchelles
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documen-tation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use,copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whomthe Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of theSoftware.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PAR-TICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHTHOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTIONOF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFT-WARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
2.4 Configure Scrapoxy
2.4.1 Create configuration
To create a new configuration, use:
scrapoxy init conf.json
2.4.2 Multi-Providers Example configuration
This command creates a configuration example, with 4 providers:
{"commander": {
"password": "CHANGE_THIS_PASSWORD"},"instance": {
"port": 3128,
(continues on next page)
14 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
(continued from previous page)
"scaling": {"min": 1,"max": 2
}},"providers": [
{"type": "awsec2","accessKeyId": "YOUR ACCESS KEY ID","secretAccessKey": "YOUR SECRET ACCESS KEY","region": "YOUR REGION (could be: eu-west-1)","instance": {
"InstanceType": "t1.micro","ImageId": "ami-c74d0db4","SecurityGroups": [
"forward-proxy"]
}},{
"type": "ovhcloud","endpoint": "YOUR ENDPOINT (could be: ovh-eu)","appKey": "YOUR APP KEY","appSecret": "YOUR APP SECRET","consumerKey": "YOUR CONSUMER KEY","serviceId": "YOUR SERVICE ID","region": "YOUR REGION (could be: BHS1, GRA1 or SBG1)","sshKeyName": "YOUR SSH KEY (could be: mykey)","flavorName": "vps-ssd-1","snapshotName": "YOUR SNAPSHOT NAME (could be: forward-proxy)"
},{
"type": "digitalocean","token": "YOUR PERSONAL TOKEN","region": "YOUR REGION (could be: lon1)","size": "s-1vcpu-1gb (previous: 512mb)","sshKeyName": "YOUR SSH KEY (could be: mykey)","imageName": "YOUR SNAPSHOT NAME (could be: forward-proxy)","tags": "YOUR TAGS SEPARATED BY A COMMA (could be: proxy,instance)"
},{
"type": "vscale","token": "YOUR PERSONAL TOKEN","region": "YOUR REGION (could be: msk0, spb0)","imageName": "YOUR SNAPSHOT NAME (could be: forward-proxy)","sshKeyName": "YOUR SSH KEY (could be: mykey)","plan": "YOUR PLAN (could be: small)"
}]
}
2.4. Configure Scrapoxy 15
Scrapoxy Documentation, Release 3.1.1
2.4.3 Options: commander
Option Default value Descriptionport 8889 TCP port of the REST APIpassword none Password to access to the commander
2.4.4 Options: instance
Option Defaultvalue
Description
port none TCP port of your instance (example: 3128)username none Credentials if your proxy instance needs them (optional)password none Credentials if your proxy instance needs them (optional)scaling none see instance / scalingcheckDelay 10000 (in ms) Scrapoxy requests the status of instances to the provider, every X
mscheckAliveDelay 20000 (in ms) Scrapoxy pings instances every X msstopIfCrashedDe-lay
120000 (in ms) Scrapoxy restarts an instance if it has been dead for X ms
autorestart none see instance / autorestart
2.4.5 Options: instance / autorestart
Scrapoxy randomly restarts instance to change the IP address.
The delay is between minDelay and maxDelay.
Option Default value DescriptionminDelay 3600000 (in ms) Minimum delaymaxDelay 43200000 (in ms) Maximum delay
2.4.6 Options: instance / scaling
Option Default value Descriptionmin none The desired count of instances when Scrapoxy is asleepmax none The desired count of instances when Scrapoxy is awakerequired none The count of actual instancesdownscaleDelay 600000 (in ms) Time to wait to remove unused instances when Scrapoxy is not in use
2.4.7 Options: logs
Option Default value Descriptionpath none If specified, writes all logs in a dated file
16 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
2.4.8 Options: providers
Providers is an array of provider. It can contains multiple providers:
• AWS EC2: see AWS EC2 - Configuration
• OVH Cloud: see OVH Cloud - Configuration
• DigitalOcean: see DigitalOcean - Configuration
• Vscale: see Vscale - Configuration
2.4.9 Options: proxy
Option Defaultvalue
Description
port 8888 TCP port of Scrapoxyauth none see proxy / auth (optional)domains_allowed [] Whitelisted domains: only URLs with this domains are allowed (ignored if
empty)do-mains_forbidden
[] Blacklisted domains: URLs with this domains are rejected (ignored if empty)
mitm False see man in the middle (optional)
2.4.10 Options: proxy / auth
Option Default value Descriptionusername none Credentials if your Scrapoxy needs thempassword none Credentials if your Scrapoxy needs them
2.4.11 Options: proxy / mitm
Option Default value Descriptioncert_filename none Public key filename for MITM certificate (scrapoxy has a default one)key_filename none Private key filename for MITM certificate (scrapoxy has a default one)
2.4.12 Options: stats
Option Default value Descriptionretention 86400000 (in ms) Duration of statistics retentionsamplingDelay 1000 (in ms) Get stats every X ms
2.4. Configure Scrapoxy 17
Scrapoxy Documentation, Release 3.1.1
2.5 AWS / EC2
2.5.1 Get started
Step 1: Get your AWS credentials
See Get credentials.
Remember your Access Key and Secret Access Key.
Step 2: Connect to your region
Step 3: Create a security group
See Create a security group.
Step 4: Choose an AMI
Public AMI are available for theses regions:
• eu-west-1 / t1.micro: ami-c74d0db4
• eu-west-1 / t2.micro: ami-485fbba5
• eu-west-1 / t2.nano: ami-06220275
If you cannot find your region, you can Copy an AMI from a region to another.
Step 5: Update configuration
Open conf.json:
{"providers": ["type": "awsec2",
"region": "eu-west-1",
(continues on next page)
18 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
(continued from previous page)
"instance": {"InstanceType": "t1.micro","ImageId": "ami-c74d0db4","SecurityGroups": [ "forward-proxy" ],
}]
}
And update region and ImageId with your parameters.
2.5.2 Configure Scrapoxy
Options: awsec2
For credentials, there is 2 choices:
1. Add credentials in the configuration file;
2. Or Use your own credentials (from profile, see the AWS documentation).
Option Defaultvalue
Description
type none Must be awsec2accessKeyId none Credentials for AWS (optional)secretAc-cessKey
none Credentials for AWS (optional)
region none AWS region (example: eu-west-1)tag Proxy Name of the AWS / EC2 instanceinstance none see awsec2 / instancemax none Maximum number of instances for this provider. If empty, there is no maxi-
mum.
Options: awsec2 / instance
Options are specific to AWS / EC2.
Scrapoxy use the method runInstances to create new instances.
Standard options are InstanceType, ImageId, KeyName, and SecurityGroups.
2.5.3 Tutorials
Tutorial: AWS / EC2 - Get credentials
Step 1: Connect to your AWS console
Go to AWS console.
2.5. AWS / EC2 19
Scrapoxy Documentation, Release 3.1.1
Step 2: Go to credentials
Step 3: Create a new key
1. Click on Access Key
2. Click on Create New Access Key
20 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Step 4: Get the credentials
1. Click on Show Access Key
2. Get the values of Access Key ID and Secret Access Key
Tutorial: AWS / EC2 - Create a security group
Security groups (and AMI) are restricted to a region.
A security group in eu-west-1 is not available in eu-central-1.
Warning: You must create a security group by region.
Step 1: Connect to your AWS console
Go to AWS console.
2.5. AWS / EC2 21
Scrapoxy Documentation, Release 3.1.1
Step 2: Go to EC2 dashboard
Step 3: Create a Security Groups
1. Click on Security Groups
2. Click on Create Security Group
22 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Step 4: Fill the Security Groups
1. Fill the name and the description with forward-proxy
2. Fill the Inbound rule with:
• Type: Custom TCP Rule
• Port Range: 3128
• Source: Anywhere
3. Click on Create
2.5. AWS / EC2 23
Scrapoxy Documentation, Release 3.1.1
Tutorial: AWS / EC2 - Copy an AMI from a region to another
AMI (and security groups) are restricted to a region.
A AMI in eu-west-1 is not available in eu-central-1.
Warning: You must create an AMI by region.
Step 1: Connect to your AWS console
Go to AWS console.
Step 2: Connect to Ireland region
24 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Step 3: Go to EC2 dashboard
Step 4: Find the public AMI
1. Click on AMIs
2. Search ami-c74d0db4
2.5. AWS / EC2 25
Scrapoxy Documentation, Release 3.1.1
Step 5: Open copy AMI wizard
1. Right click on instance
2. Click on Copy AMI
Step 6: Start AMI copy
1. Choose the new destination region
2. Click on Copy AMI
26 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Step 7: Connect to the new region
Step 8: Find the new AMI ID
The new AMI ID is in the column AMI ID.
2.5. AWS / EC2 27
Scrapoxy Documentation, Release 3.1.1
2.6 DigitalOcean
2.6.1 Get started
Step 1: Get your DigitalOcean credentials
See Get DigitalOcean credentials.
Remember your token.
Step 2: Create a SSH key for the project
See Create a SSH key.
Remember your SSH key name (mykey).
Step 3: Create an image
See Create an image.
Remember your image name (forward-proxy).
Step 4: Update configuration
Open conf.json:
{"providers": ["type": "digitalocean",
"token": "YOUR PERSONAL TOKEN","region": "YOUR REGION (could be: lon1)","size": "s-1vcpu-1gb (previous: 512mb)","sshKeyName": "YOUR SSH KEY (could be: mykey)","imageName": "YOUR SNAPSHOT NAME (could be: forward-proxy)","tags": "YOUR TAGS SEPARATED BY A COMMA (could be: proxy,instance)"
]},
And update config with your parameters.
28 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
2.6.2 Configure Scrapoxy
Options: digitalocean
Option Default value Descriptiontype none Must be digitaloceantoken none Credentials for DigitalOceanregion none DigitalOcean region (example: lon1)sshKeyName none Name of the SSH keysize none Type of dropletname Proxy Name of the dropletimageName none Name of the image (for the proxy droplet)tags none Tags separated by a comma (example: proxy,instance)max none Maximum number of instances for this provider. If empty, there is no maximum.
2.6.3 Tutorials
Tutorial: DigitalOcean - Get credentials
Step 1: Connect to your DigitalOcean console
Go to DigitalOcean console.
Step 2: Generate a new token
1. Click on API
2. Click on Generate New Token
2.6. DigitalOcean 29
Scrapoxy Documentation, Release 3.1.1
Step 3: Create a new token
1. Enter scrapoxy in Token Name
2. Click on Generate Token
Step 4: Get the credentials
Remember the token
30 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Tutorial: DigitalOcean - Create a SSH key
Step 1: Connect to your DigitalOcean console
Go to DigitalOcean console.
Step 2: Go to settings
1. Click on the top right icon
2. Click on Settings
Step 3: Add a new key
1. Click on Security
2. Click on Add SSH Key
2.6. DigitalOcean 31
Scrapoxy Documentation, Release 3.1.1
Step 4: Create a new key
1. Paste your SSH key
2. Enter mykey for the name
3. Click on Add SSH Key
You can generate your key with this tutorial on Github.
And remember the name of the key!
Tutorial: DigitalOcean - Create an image
Step 1: Connect to your DigitalOcean console
Go to DigitalOcean console.
32 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Step 2: Create a new droplet
Click on Create and Droplets
Step 3: Change the configuration of droplet
Choose an image Ubuntu 16.04.3 x64:
Choose the smallest size on Standard Droplets:
2.6. DigitalOcean 33
Scrapoxy Documentation, Release 3.1.1
Choose a datacenter (e.g.: London):
Use the SSH key named mykey:
Step 4: Start the droplet
Click on Create
2.6. DigitalOcean 35
Scrapoxy Documentation, Release 3.1.1
Step 5: Connect to the instance
Get the IP:
And enter this command in a terminal:
ssh root@<replace by IP>
Step 6: Install the proxy
Install proxy with:
curl --silent --location https://deb.nodesource.com/setup_8.x | sudo bash -
and:
sudo apt-get install --yes nodejs
and:
curl --silent --location https://raw.githubusercontent.com/fabienvauchelles/scrapoxy/→˓master/tools/install/proxy.js | sudo tee /root/proxy.js > /dev/null
and:
curl --silent --location https://raw.githubusercontent.com/fabienvauchelles/scrapoxy/→˓master/tools/install/proxyup.sh | sudo tee /etc/init.d/proxyup.sh > /dev/null
and:
36 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
sudo chmod a+x /etc/init.d/proxyup.sh
and:
sudo update-rc.d proxyup.sh defaults
and:
sudo /etc/init.d/proxyup.sh start
Step 7: Poweroff the droplet
1. Stop the last command (CTRL-C)
2. Power off the droplet:
sudo poweroff
The green icon disappears when the droplet is off:
Step 8: Create a backup
1. Click on Images
2. Select your droplet
3. Enter forward-proxy in Image Name
4. Click on Take Snapshot
2.6. DigitalOcean 37
Scrapoxy Documentation, Release 3.1.1
Wait a few minutes and the new image appears:
2.7 OVH Cloud
2.7.1 Get started
Step 1: Get your OVH credentials
See Get OVH credentials.
Remember your Application Key, Application Secret and consumerKey.
Step 2: Create a project
See Create a project.
Remember your serviceId.
38 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Step 3: Create a SSH key for the project
See Create a SSH key.
Remember your SSH key name (mykey).
Step 4: Create a proxy image
See Create a proxy image.
Remember your image name (forward-proxy).
Step 5: Update configuration
Open conf.json:
{"providers": ["type": "ovhcloud",
"endpoint": "YOUR ENDPOINT (could be: ovh-eu)","appKey": "YOUR APP KEY","appSecret": "YOUR APP SECRET","consumerKey": "YOUR CONSUMERKEY","serviceId": "YOUR SERVICEID","region": "YOUR REGION (could be: BHS1, GRA1 or SBG1)","sshKeyName": "YOUR SSH KEY (could be: mykey)","flavorName": "vps-ssd-1","snapshotName": "YOUR SNAPSHOT NAME (could be: forward-proxy)"
]},
And update config with your parameters.
2.7.2 Configure Scrapoxy
Options: ovhcloud
Option Defaultvalue
Description
type none Must be ovhcloudendpoint none OVH subdivision (ovh-eu or ovh-ca)appKey none Credentials for OVHappSecret none Credentials for OVHconsumerKey none Credentials for OVHserviceId none Project IDregion none OVH region (example: GRA1)sshKeyName none Name of the SSH keyflavorName none Type of instancename Proxy Name of the instancesnapshot-Name
none Name of the backup image (for the proxy instance)
max none Maximum number of instances for this provider. If empty, there is no maximum.
2.7. OVH Cloud 39
Scrapoxy Documentation, Release 3.1.1
2.7.3 Tutorials
Tutorial: OVH Cloud - Get credentials
Step 1: Create an API Application
1. Go on https://eu.api.ovh.com/createApp/
2. Enter ID and password
3. Add a name name (e.g.: scrapoxy-12)
4. Add a description (e.g.: scrapoxy)
5. Click on Create keys
Step 2: Save application credentials
Remember Application Key and Application Secret:
40 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Step 3: Get the consumerKey
Use Scrapoxy to get your key:
scrapoxy ovh-consumerkey <endpoint> <Application Key> <Application Secret>
Endpoints are:
• ovh-eu for OVH Europe;
• ovh-ca for OVH North-America.
Remember consumerKey and click on validation URL to validate key.
Step 4: Add permission
When you open validation URL:
1. Enter ID and password
2. Choose Unlimited validity
3. Click on Authorize Access
2.7. OVH Cloud 41
Scrapoxy Documentation, Release 3.1.1
Tutorial: OVH Cloud - Create a project
Step 1: Connect to your OVH dashboard
Go to OVH Cloud.
Step 2: Create a new project
1. Click on Create a new project
2. Fill project name
3. Click on Create the project
42 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Step 3: Save serviceID
Remember the ID in the URL: it is the project serviceId.
Tutorial: OVH Cloud - Create a SSH key
Step 1: Connect to your new project
Go to OVH dashboard and select your new project.
Step 2: Create a new key
1. Click on the tab SSH keys
2. Click on Add a key
2.7. OVH Cloud 43
Scrapoxy Documentation, Release 3.1.1
Step 3: Enter the key
1. Set the name of the key mykey
2. Copy your key
3. Click on Add the key
You can generate your key with this tutorial on Github.
And remember the name of the key!
Tutorial: OVH Cloud - Create a proxy image
Step 1: Connect to your new project
Go to OVH dashboard and select your new project.
Step 2: Create a new server
1. Click on Infrastructure
2. Click on Add
3. Click on Add a server
44 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Step 3: Change the configuration of server
1. Change the type of server to VPS-SSD-1 (cheapest)
2. Change the distribution to Ubuntu
3. Change region to GRA1 (or another if you want)
4. Change the SSH key to mykey
5. Click on Launch now
2.7. OVH Cloud 45
Scrapoxy Documentation, Release 3.1.1
Step 4: Get login information
When the instance is ready:
1. Click on the v on the top right corner
2. Click on Login information
Step 5: Connect to the instance
Remember the SSH command.
Step 6: Install the proxy
Connect to the instance and install proxy:
2.7. OVH Cloud 47
Scrapoxy Documentation, Release 3.1.1
sudo apt-get install curl
and:
curl --silent --location https://deb.nodesource.com/setup_8.x | sudo bash -
and:
sudo apt-get install --yes nodejs
and:
curl --silent --location https://raw.githubusercontent.com/fabienvauchelles/scrapoxy/→˓master/tools/install/proxy.js | sudo tee /root/proxy.js > /dev/null
and:
curl --silent --location https://raw.githubusercontent.com/fabienvauchelles/scrapoxy/→˓master/tools/install/proxyup.sh | sudo tee /etc/init.d/proxyup.sh > /dev/null
and:
sudo chmod a+x /etc/init.d/proxyup.sh
and:
sudo update-rc.d proxyup.sh defaults
and:
sudo /etc/init.d/proxyup.sh start
Step 7: Create a backup
Go back on the OVH project dashboard:
1. Click on the v on the top right corner
2. Click on Create a backup
48 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Step 8: Start the backup
1. Enter the snapshot name forward-proxy
2. Click on Take a snapshot
You need to wait 10 minutes to 1 hour.
2.7. OVH Cloud 49
Scrapoxy Documentation, Release 3.1.1
2.8 Vscale
Vscale is a russian cloud platform, like DigitalOcean.
Note: IP addresses are updated every hour. If the instances are restarted too quickly, they will have the same IPaddress.
2.8.1 Get started
Step 1: Get your Vscale credentials
See Get Vscale credentials.
Remember your token.
Step 2: Create a SSH key for the project
See Create a SSH key.
Remember your SSH key name (mykey).
Step 3: Create an image
See Create an image.
Remember your image name (forward-proxy).
Step 4: Update configuration
Open conf.json:
{"providers": ["type": "vscale","token": "YOUR PERSONAL TOKEN","region": "YOUR REGION (could be: msk0, spb0)","name": "YOUR SERVER NAME","sshKeyName": "YOUR SSH KEY (could be: mykey)","plan": "YOUR PLAN (could be: small)"
]},
And update config with your parameters.
50 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
2.8.2 Configure Scrapoxy
Options: vscale
Option Default value Descriptiontype none Must be vscaletoken none Credentials for Vscaleregion none Vscale region (example: msk0, spb0)sshKeyName none Name of the SSH keyplan none Type of plan (example: small)name Proxy Name of the scaletimageName none Name of the image (for the proxy scalet)max none Maximum number of instances for this provider. If empty, there is no maximum.
2.8.3 Tutorials
Tutorial: Vscale - Get credentials
Step 1: Connect to your Vscale console
Go to Vscale console.
Step 2: Generate a new token
1. Click on SETTINGS
2. Click on API tokens
3. Click on CREATE TOKEN
2.8. Vscale 51
Scrapoxy Documentation, Release 3.1.1
Step 3: Create a new token
1. Enter scrapoxy in Token description
2. Click on GENERATE TOKEN
Step 4: Get the credentials
1. Click on COPY TO CLIPBOARD
2. Click on SAVE
Remember the token
52 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Tutorial: Vscale - Create a SSH key
Step 1: Connect to your Vscale console
Go to Vscale console.
Step 2: Add a new key
1. Click on SETTINGS
2. Click on SSH keys
3. Click on ADD SSH KEY
2.8. Vscale 53
Scrapoxy Documentation, Release 3.1.1
Step 3: Create a new key
1. Enter mykey for the name
2. Paste your SSH key
3. Click on ADD
54 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
You can generate your key with this tutorial on Github.
And remember the name of the key!
Tutorial: Vscale - Create an image
Step 1: Connect to your Vscale console
Go to Vscale console.
Step 2: Create a new scalet
1. Click on SERVERS
2. Click on CREATE SERVER
Step 3: Change the configuration of scalet
Choose an image Ubuntu 16.04 64bit:
2.8. Vscale 55
Scrapoxy Documentation, Release 3.1.1
Choose the smallest configuration:
Choose a datacenter (e.g.: Moscow):
56 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Use the SSH key named mykey:
Step 4: Create the scalet
Click on CREATE SERVER
Step 5: Connect to the instance
Get the IP:
2.8. Vscale 57
Scrapoxy Documentation, Release 3.1.1
And enter this command in a terminal:
ssh root@<replace by IP>
Step 6: Install the proxy
Install proxy with:
sudo apt-get update
and:
sudo apt-get install --yes curl
and:
curl --silent --location https://deb.nodesource.com/setup_8.x | sudo bash -
and:
sudo apt-get install --yes nodejs
and:
curl --silent --location https://raw.githubusercontent.com/fabienvauchelles/scrapoxy/→˓master/tools/install/proxy.js | sudo tee /root/proxy.js > /dev/null
and:
curl --silent --location https://raw.githubusercontent.com/fabienvauchelles/scrapoxy/→˓master/tools/install/proxyup.sh | sudo tee /etc/init.d/proxyup.sh > /dev/null
and:
sudo chmod a+x /etc/init.d/proxyup.sh
and:
58 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
sudo update-rc.d proxyup.sh defaults
and:
sudo /etc/init.d/proxyup.sh start
Step 7: Poweroff the scalet
1. Stop the last command (CTRL-C)
2. Power off the scalet:
sudo poweroff
The status is Offline when the scalet is off:
Step 8: Open the scalet
Click on the scalet:
2.8. Vscale 59
Scrapoxy Documentation, Release 3.1.1
Step 9: Prepare backup
Click on the scalet:
Click on Create backup
Step 10: Create backup
3. Enter forward-proxy in Name
4. Click on CREATE BACKUP
60 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Wait a few minutes and the new image appears:
2.9 Manage Scrapoxy with a GUI
2.9.1 Connect
You can access to the GUI at http://localhost:8889/
2.9. Manage Scrapoxy with a GUI 61
Scrapoxy Documentation, Release 3.1.1
2.9.2 Login
Enter your password.
The password is defined in the configuration file, key commander.password.
2.9.3 Dashboard
Scrapoxy GUI has many pages:
• Instances. This page contains the list of instances managed by Scrapoxy;
• Stats. This page contains statistics on the use of Scrapoxy.
To login page redirects to the Instances page.
62 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
2.9.4 Page: Instances
Scaling
This panel shows the number of instances.
Scrapoxy has 3 settings:
• Min. The desired count of instances when Scrapoxy is asleep;
• Max. The desired count of instances when Scrapoxy is awake;
• Required. The count of actual instances.
To add or remove an instance, click on the Scaling button and change the Required setting:
Status of an instance
Each instance is described in a panel.
This panel contains many information:
• Name of the instance;
• IP of the instance;
• Provider type;
• Instance status on the provider;
• Instance status in Scrapoxy.
Scrapoxy relays requests to instances which are started and alived ( + ).
2.9. Manage Scrapoxy with a GUI 63
Scrapoxy Documentation, Release 3.1.1
Type of provider
AWS / EC2
OVH Cloud
DigitalOcean
VScale
Status in the provider
Starting
Started
Stopping
Stopped
Status in Scrapoxy
Alive
Dead
Remove an instance
Click on the instance to delete it.
The instance stops and is replaced by another.
2.9.5 Page: Stats
There are 3 panels in stats:
• Global stats. This panel contains global stats;
• Requests. This panel contains the count of requests;
• Flow. This panel contains the flow requests.
64 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Global
This panel has 4 indicators:
• the total count of requests to monitor performance;
• the total count of received and sent data to control the volume of data;
• the total of stop instance orders, to monitor anti-blacklisting;
• the count of requests received by an instance (minimum, average, maximum) to check anti-blacklisting per-formance.
Requests
This panel combines 2 statistics on 1 chart.
It measures:
• the count of requests per minute;
• the average execution time of a request (round trip), per minute.
2.9. Manage Scrapoxy with a GUI 65
Scrapoxy Documentation, Release 3.1.1
Flow
This panel combines 2 statistics on 1 chart.
It measures:
• the flow received by Scrapoxy;
• the flow sent by Scrapoxy.
How to increase the number of requests per minute ?
You add new instances (or new scrapers).
Do you increase the number of requests par minute ?
• Yes: Perfect!
• No: You pay instances for nothing.
Do I overload the target website ?
You add new instances (or new scrapers).
Did the time of response increase ?
• Yes: The target website is overloaded.
• No: Perfect!
2.10 Understand Scrapoxy
2.10.1 Architecture
Scrapoxy consists of 4 parts:
66 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
• the master, which routes requests to proxies;
• the manager, which starts and stops proxies;
• the commander, which provides a REST API to receive orders;
• the gui, which connects to the REST API.
When Scrapoxy starts, the manager starts a new instance (if necessary), on the cloud.
When the scraper sends a HTTP request, the manager starts all others proxies.
2.10.2 Requests
Can Scrapoxy relay HTTPS requests ?
Yes. There is 2 modes:
2.10. Understand Scrapoxy 67
Scrapoxy Documentation, Release 3.1.1
Mode A: HTTPS CONNECT with MITM (Man-In-The-Middle)
This mode is for unsecure browser like PhantomJS. It allow Scrapoxy to decrypt SSL and override HTTP headers(like User-Agent).
This solution can trigger some SSL alerts.
Mode B: HTTPS CONNECT without MITM
This mode is for secure browser. It doesn’t allow Scrapoxy to override HTTP headers (like User-Agent). You mustmanually set the User-Agent.
The best solution is to use only 1 User-Agent (it would be strange to have multiple User-Agents coming from 1 IP,isn’t it?).
Mode C: HTTPS over HTTP (or *no tunnel* mode)
This mode is for scraper. It allows Scrapoxy to override HTTP headers (like User-Agent).
The scraper must send a HTTP request with an HTTPS URL in the Location header.
Example:
GET /index.htmlHost: localhost:8888Location: https://www.google.com/index.htmlAccept: text/html
Scrapers accept a GET (or POST) method instead of CONNECT for proxy.
With Scrapy (Python), add /?noconnect to the proxy URL:
PROXY='http://localhost:8888/?noconnect
With Request (Node.js), add tunnel:false to options:
request({method: 'GET',url: 'https://api.ipify.org/',tunnel: false,proxy: 'http://localhost:8888',
}, (err, response, body) => {...});
What is the proxy that returned the response ?
Scrapoxy adds to the response an HTTP header x-cache-proxyname.
This header contains the name of the proxy.
If you are using HTTPS in HTTPS CONNECT without MITM, Scrapoxy is unable to add this header since the trafficis encrypted.
Can the scraper force the request to go through a specific proxy?
Yes. The scraper adds the proxy name in the header x-cache-proxyname.
When the scraper receives a response, this header is extracted. The scraper adds this header to the next request.
68 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Does Scrapoxy override User Agent ?
Yes. When an instance starts (or restarts), it gets a random User Agent (from the User Agent list).
When the instance receives a request, it overrides the User Agent.
2.10.3 Blacklisting
How can you manage blacklisted response ?
Remember, Scrapoxy cannot detect blacklisted response because it is too specific to a scraping usecase. It can be a503 HTTP response, a captcha, a longer response, etc.
Anti-blacklisting is a job for the scraper:
1. The scraper must detect a blacklisted response;
2. The scraper extracts the name of the instance from the HTTP response header (see here);
3. The scraper asks to Scrapoxy to remove the instance with the API (see here).
When the blacklisted response is detected, Scrapoxy will replace the instance with a valid one (new IP address).
There is a tutorial: Manage blacklisted request with Scrapy.
2.10.4 Instances management
How does multi-providers work ?
In the configuration file, you can specify multiple providers (the providers field is an array).
You can also specify the maximum number of instances by provider, with the max parameter (for example: 2 instancesmaximum for AWSEC2 and unlimited for DigitalOcean).
When several instances are requested, the algorithm randomly asks the instances at the providers, within the specifiedcapacities.
How does the monitoring mechanism ?
1. the manager asks the cloud how many instances are alive. It is the initial state;
2. the manager creates a target state, with the new count of instance;
3. the manager generates the commands to reach target state from the initial state;
4. the manager sends the commands to the cloud.
These steps are very important because you cannot guess which is the initial state. Because an instance may bedead!
Scrapoxy can restart an instance if:
• the instance is dead (stop status or no ping);
• the living limit is reached: Scrapoxy regulary restarts the instance to change the IP address.
2.10. Understand Scrapoxy 69
Scrapoxy Documentation, Release 3.1.1
Do you need to create a VM image ?
By default, we provide you an AMI proxy instance on AWS / EC2. This is a CONNECT proxy opened on TCP port3128.
But you can use every software which accept the CONNECT method (Squid, Tinyproxy, etc.).
Can you leave Scrapoxy started ?
Yes. Scrapoxy has 2 modes: an awake mode and an asleep mode.
When Scrapoxy receives no request after a while, he falls asleep. It sets the count of instances to minimum(instance.scaling.min).
When Scrapoxy receives a request, it wakes up. It fixes the count of instances to maximum (instance.scaling.max).
Scrapoxy needs at least 1 instance to receive the awake request.
2.11 Control Scrapoxy with a REST API
2.11.1 Endpoint
You can access to the commander API at http://localhost:8889/api
2.11.2 Authentication
Every requests must have an Authorization header.
The value is the hash base64 of the password set in the configuration (commander/password).
70 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
2.11.3 Instances
Get all instances
Request
GET http://localhost:8889/api/instances
Response (JSON)
Status: 200
The body contains all informations about instances.
Example:
const request = require('request');
const password = 'YOUR_COMMANDER_PASSWORD';
const opts = {method: 'GET',url: 'http://localhost:8889/api/instances',headers: {
'Authorization': new Buffer(password).toString('base64'),},
};
request(opts, (err, res, body) => {if (err) return console.log('Error: ', err);
console.log('Status: %d\n\n', res.statusCode);
const bodyParsed = JSON.parse(body);
console.log(bodyParsed);});
Stop an instance
Request
POST http://localhost:8889/api/instances/stop
JSON payload:
{"name": "<name of the proxy>"
}
2.11. Control Scrapoxy with a REST API 71
Scrapoxy Documentation, Release 3.1.1
Response (JSON)
Status: 200
The instance exists.
Scrapoxy stops it. And the instance is restarted, with a new IP address.
The body contains the remaining count of alive instances.
{"alive": <count>
}
Status: 404
The instance does not exist.
Example:
const request = require('request');
const password = 'YOUR_COMMANDER_PASSWORD',instanceName = 'YOUR INSTANCE NAME';
const opts = {method: 'POST',url: 'http://localhost:8889/api/instances/stop',json: {
name: instanceName,},headers: {
'Authorization': new Buffer(password).toString('base64'),},
};
request(opts, (err, res, body) => {if (err) return console.log('Error: ', err);
console.log('Status: %d\n\n', res.statusCode);
console.log(body);});
2.11.4 Scaling
Get the scaling
Request
GET http://localhost:8889/api/scaling
Response (JSON)
Status: 200
72 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
The body contains all the configuration of the scaling.
Example:
const request = require('request');
const password = 'YOUR_COMMANDER_PASSWORD';
const opts = {method: 'GET',url: 'http://localhost:8889/api/scaling',headers: {
'Authorization': new Buffer(password).toString('base64'),},
};
request(opts, (err, res, body) => {if (err) return console.log('Error: ', err);
console.log('Status: %d\n\n', res.statusCode);
const bodyParsed = JSON.parse(body);
console.log(bodyParsed);});
Update the scaling
Request
PATCH http://localhost:8889/api/scaling
JSON payload:
{"min": "min_scaling","required": "required_scaling","max": "max_scaling",
}
Response (JSON)
Status: 200
The scaling is updated.
Status: 204
The scaling is not updated.
Example:
const request = require('request');
const password = 'YOUR_COMMANDER_PASSWORD';
(continues on next page)
2.11. Control Scrapoxy with a REST API 73
Scrapoxy Documentation, Release 3.1.1
(continued from previous page)
const opts = {method: 'PATCH',url: 'http://localhost:8889/api/scaling',json: {
min: 1,required: 5,max: 10,
},headers: {
'Authorization': new Buffer(password).toString('base64'),},
};
request(opts, (err, res, body) => {if (err) return console.log('Error: ', err);
console.log('Status: %d\n\n', res.statusCode);
console.log(body);});
2.11.5 Configuration
Get the configuration
Request
GET http://localhost:8889/api/config
Response (JSON)
Status: 200
The body contains all the configuration of Scrapoxy (including scaling).
Example:
const request = require('request');
const password = 'YOUR_COMMANDER_PASSWORD';
const opts = {method: 'GET',url: 'http://localhost:8889/api/config',headers: {
'Authorization': new Buffer(password).toString('base64'),},
};
request(opts, (err, res, body) => {if (err) return console.log('Error: ', err);
(continues on next page)
74 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
(continued from previous page)
console.log('Status: %d\n\n', res.statusCode);
const bodyParsed = JSON.parse(body);
console.log(bodyParsed);});
Update the configuration
Request
PATCH http://localhost:8889/api/config
JSON payload:
{"key_to_override": "<new_value>","section": {"key2_to_override": "<new value>"
}}
Response (JSON)
Status: 200
The configuration is updated.
Status: 204
The configuration is not updated.
Example:
const request = require('request');
const password = 'YOUR_COMMANDER_PASSWORD';
const opts = {method: 'PATCH',url: 'http://localhost:8889/api/config',json: {
instance: {scaling: {
max: 300,},
},},headers: {
'Authorization': new Buffer(password).toString('base64'),},
};
(continues on next page)
2.11. Control Scrapoxy with a REST API 75
Scrapoxy Documentation, Release 3.1.1
(continued from previous page)
request(opts, (err, res, body) => {if (err) return console.log('Error: ', err);
console.log('Status: %d\n\n', res.statusCode);
console.log(body);});
2.12 Secure Scrapoxy
2.12.1 Secure Scrapoxy with Basic auth
Scrapoxy supports standard HTTP Basic auth (RFC2617).
Step 1: Add username and password in configuration
Open conf.json and add auth section in proxy section (see Configure Scrapoxy):
{"proxy": {
"auth": {"username": "myuser","password": "mypassword"
}}
}
Step 2: Add username and password to the scraper
Configure your scraper to use username and password:
The URL is: http://myuser:mypassword@localhost:8888
(replace myuser and mypassword with your credentials).
Step 3: Test credentials
Open a new terminal and test Scrapoxy without credentials:
scrapoxy test http://localhost:8888
It doesn’t work.
Now, test Scrapoxy with your credentials:
scrapoxy test http://myuser:mypassword@localhost:8888
(replace myuser and mypassword with your credentials).
76 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
2.12.2 Secure Scrapoxy with a firewall on Ubuntu
UFW simplifies IPTables on Ubuntu (>14.04).
Step 1: Allow SSH
sudo ufw allow ssh
Step 2: Allow Scrapoxy
sudo ufw allow 8888/tcpsudo ufw allow 8889/tcp
Step 3: Enable UFW
sudo ufw enable
Enter y.
Step 4: Check UFW status and rules
sudo ufw status
2.13 Launch Scrapoxy at startup
2.13.1 Prerequise
Scrapoxy is installed with a valid configuration (see Quick Start).
2.13.2 Step 1: Install PM2
sudo npm install -g pm2
2.13.3 Step 2: Launch PM2 at instance startup
sudo pm2 startup ubuntu -u <YOUR USERNAME>
1. Replace ubuntu by your distribution name (see PM2 documentation).
2. Replace YOUR USERNAME by your Linux username
2.13. Launch Scrapoxy at startup 77
Scrapoxy Documentation, Release 3.1.1
2.13.4 Step 3: Create a PM2 configuration
Create a PM2 configuration file scrapoxy.json5 for Scrapoxy:
{apps : [{
name : "scrapoxy",script : "/usr/bin/scrapoxy",args : ["start", "<ABSOLUTE PATH TO CONFIG>/conf.json", "-d"],kill_timeout : 30000,
},],
}
2.13.5 Step 4: Start configuration
pm2 start scrapoxy.json5
2.13.6 Step 5: Save configuration
pm2 save
2.13.7 Step 6: Stop PM2 (optional)
If you need to stop Scrapoxy in PM2:
pm2 stop scrapoxy.json5
2.14 Integrate Scrapoxy to Scrapy
2.14.1 Goal
Is it easy to find a good Python developer on Paris ? No!
So, it’s time to build a scraper with Scrapy to find our perfect profile.
The site Scraping Challenge indexes a lot of profiles (fake, for demo purposes). We want to grab them and create aCSV file.
However, the site is protected against scraping ! We must use Scrapoxy to bypass the protection.
2.14.2 Step 1: Install Scrapy
Install Python 2.7
Scrapy works with Python 2.7
78 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
Install dependencies
On Ubuntu:
apt-get install python-dev libxml2-dev libxslt1-dev libffi-dev
On Windows (with Babun):
wget https://bootstrap.pypa.io/ez_setup.py -O - | pythoneasy_install pippact install libffi-devel libxml2-devel libxslt-devel
Install Scrapy and Scrapoxy connector
pip install scrapy scrapoxy
2.14.3 Step 2: Create the scraper myscraper
Create a new project
Bootstrap the skeleton of the project:
scrapy startproject myscrapercd myscraper
Add a new spider
Add this content to myscraper/spiders/scraper.py:
# -*- coding: utf-8 -*-
from scrapy import Request, Spider
class Scraper(Spider):name = u'scraper'
def start_requests(self):"""This is our first request to grab all the urls of the profiles."""yield Request(
url=u'http://scraping-challenge-2.herokuapp.com',callback=self.parse,
)
def parse(self, response):"""We have all the urls of the profiles. Let's make a request for each
→˓profile."""urls = response.xpath(u'//a/@href').extract()for url in urls:
yield Request(url=response.urljoin(url),
(continues on next page)
2.14. Integrate Scrapoxy to Scrapy 79
Scrapoxy Documentation, Release 3.1.1
(continued from previous page)
callback=self.parse_profile,)
def parse_profile(self, response):"""We have a profile. Let's extract the name"""name_el = response.css(u'.profile-info-name::text').extract()if len(name_el) > 0:
yield {'name': name_el[0]
}
If you want to learn more about Scrapy, check on this Tutorial.
Run the spider
Let’s try our new scraper!
Run this command:
scrapy crawl scraper -o profiles.csv
Scrapy scraps the site and extract profiles to profiles.csv.
However, Scraping Challenge is protected! profiles.csv is empty. . .
We will integrate Scrapoxy to bypass the protection.
2.14.4 Step 3: Integrate Scrapoxy to the Scrapy
Install Scrapoxy
See Quick Start to install Scrapoxy.
Start Scrapoxy
Set the maximum of instances to 6, and start Scrapoxy (see Change scaling with GUI).
Warning: Don’t forget to set the maximum of instances!
Edit settings of the Scraper
Add this content to myscraper/settings.py:
CONCURRENT_REQUESTS_PER_DOMAIN = 1RETRY_TIMES = 0
# PROXYPROXY = 'http://127.0.0.1:8888/?noconnect'
# SCRAPOXY
(continues on next page)
80 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
(continued from previous page)
API_SCRAPOXY = 'http://127.0.0.1:8889/api'API_SCRAPOXY_PASSWORD = 'CHANGE_THIS_PASSWORD'
DOWNLOADER_MIDDLEWARES = {'scrapoxy.downloadmiddlewares.proxy.ProxyMiddleware': 100,'scrapoxy.downloadmiddlewares.wait.WaitMiddleware': 101,'scrapoxy.downloadmiddlewares.scale.ScaleMiddleware': 102,'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
}
What are these middlewares ?
• ProxyMiddleware relays requests to Scrapoxy. It is an helper to set the PROXY parameter.
• WaitMiddleware stops the scraper and waits for Scrapoxy to be ready.
• ScaleMiddleware asks Scrapoxy to maximize the number of instances at the beginning, and to stop them at theend.
Note: ScaleMiddleware stops the scraper like WaitMiddleware. After 2 minutes, all instances are ready and thescraper continues to scrap.
Warning: Don’t forget to change the password!
Run the spider
Run this command:
scrapy crawl scraper -o profiles.csv
Now, all profiles are saved to profiles.csv!
2.15 Integrate Scrapoxy to Node.js
2.15.1 Goal
Is it easy to find a good Javascript developer on Paris ? No!
So, it’s time to build a scraper with Node.js, Request and Cheerio to find our perfect profile.
The site Scraping Challenge indexes a lot of profiles (fake, for demo purposes). We want to list them.
However, the site is protected against scraping ! We must use Scrapoxy to bypass the protection.
2.15.2 Step 1: Create a Node.js project
Install Node.js
Install the latest Node.js version.
2.15. Integrate Scrapoxy to Node.js 81
Scrapoxy Documentation, Release 3.1.1
Create a new project
Create a directory for the project:
mkdir nodejs-requestcd nodejs-request
Create the package.json:
npm init --yes
Add dependencies:
npm install lodash bluebird cheerio request [email protected] --save
What are these dependencies ?
• lodash is a javascript helper,
• bluebird is a promise library,
• cheerio is a JQuery parser,
• requests makes HTTP requests,
• winston is a logger.
Add a scraper
Add this content to index.js
const _ = require('lodash'),Promise = require('bluebird'),cheerio = require('cheerio'),request = require('request'),winston = require('winston');
winston.level = 'debug';
const config = {// URL of the sitesource: 'http://scraping-challenge-2.herokuapp.com',
opts: {},
};
// Get all URLsgetProfilesUrls(config.source, config.opts)
.then((urls) => {winston.info('Found %d profiles', urls.length);
winston.info('Wait 120 seconds to scale instances');
return urls;})
(continues on next page)
82 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
(continued from previous page)
// Wait 2 minutes to scale instances.delay(2 * 60 * 1000)
.then((urls) => {
// Get profiles one by one.return Promise.map(urls,
(url) => getProfile(url, config.opts).then((profile) => {
winston.debug('Found %s', profile.name);
return profile;}).catch(() => {
winston.debug('Cannot retrieve %s', url);})
, {concurrency: 1}).then((profiles) => {
const results = _.compact(profiles);
winston.info('Extract %d on %d profiles', results.length, urls.→˓length);
});}).catch((err) => winston.error('Error: ', err));
////////////
/*** Get all the urls of the profiles
* @param url Main URL
* @param defaultOpts options for http request
* @returns {promise}
*/function getProfilesUrls(url, defaultOpts) {
return new Promise((resolve, reject) => {// Create options for the HTTP request// Add the URL to the default optionsconst opts = _.merge({}, defaultOpts, {url});
request(opts, (err, res, body) => {if (err) {
return reject(err);}
if (res.statusCode !== 200) {return reject(body);
}
// Load content into a JQuery parserconst $ = cheerio.load(body);
// Extract all urlsconst urls = $('.profile a')
.map((i, el) => $(el).attr('href'))(continues on next page)
2.15. Integrate Scrapoxy to Node.js 83
Scrapoxy Documentation, Release 3.1.1
(continued from previous page)
.get()
.map((url) => `${config.source}${url}`);
resolve(urls);});
});}
/*** Get the profile and extract the name
* @param url URL of the profile
* @param defaultOpts options for http request
* @returns {promise}
*/function getProfile(url, defaultOpts) {
return new Promise((resolve, reject) => {// Create options for the HTTP request// Add the URL to the default optionsconst opts = _.merge({}, defaultOpts, {url});
request(opts, (err, res, body) => {if (err) {
return reject(err);}
if (res.statusCode !== 200) {return reject(body);
}
// Load content into a JQuery parserconst $ = cheerio.load(body);
// Extract the namesconst name = $('.profile-info-name').text();
resolve({name});});
});}
Run the script
Let’s try our new scraper!
Run this command:
node index.js
The script scraps the site and list profiles.
However, Scraping Challenge is protected! All requests fail. . .
We will integrate Scrapoxy to bypass the protection.
84 Chapter 2. Documentation
Scrapoxy Documentation, Release 3.1.1
2.15.3 Step 2: Integrate Scrapoxy to the script
Install Scrapoxy
See Quick Start to install Scrapoxy.
Start Scrapoxy
Set the maximum of instances to 6, and start Scrapoxy (see Change scaling with GUI).
Warning: Don’t forget to set the maximum of instances!
Edit the script
Open index.js and modify the config value
const config = {// URL of the sitesource: 'http://scraping-challenge-2.herokuapp.com',
opts: {// URL of Scrapoxyproxy: 'http://localhost:8888',
// HTTPS over HTTPtunnel: false,
}};
Run the script
Run this command:
node index.js
Now, all profiles are listed!
2.16 Manage blacklisted request with Scrapy
2.16.1 Goal
A scraper is downloading pages of a website.
However, the website has a rate limit by IP. When the scraper downloads 10 pages, the website returns only an emptypage with a HTTP 429 status.
Does the scraper must wait when the limit is reached ? No!
The scraper has to ask Scrapoxy to replace the instance.
2.16. Manage blacklisted request with Scrapy 85
Scrapoxy Documentation, Release 3.1.1
2.16.2 Step 1: Create a Scraper
See Integrate Scrapoxy to Scrapy to create the scraper.
Edit settings of Scrapoxy
Add this content to myscraper/settings.py:
CONCURRENT_REQUESTS_PER_DOMAIN = 1RETRY_TIMES = 0
# PROXYPROXY = 'http://127.0.0.1:8888/?noconnect'
# SCRAPOXYAPI_SCRAPOXY = 'http://127.0.0.1:8889/api'API_SCRAPOXY_PASSWORD = 'CHANGE_THIS_PASSWORD'
# BLACKLISTINGBLACKLIST_HTTP_STATUS_CODES = [ 503 ]
DOWNLOADER_MIDDLEWARES = {'scrapoxy.downloadmiddlewares.proxy.ProxyMiddleware': 100,'scrapoxy.downloadmiddlewares.wait.WaitMiddleware': 101,'scrapoxy.downloadmiddlewares.scale.ScaleMiddleware': 102,'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,'scrapoxy.downloadmiddlewares.blacklist.BlacklistDownloaderMiddleware': 950,
}
Warning: Don’t forget to change the password!
Edit settings of the Scraper
Change the password of the commander in my-config.json:
"commander": {"password": "CHANGE_THIS_PASSWORD"
},
Warning: Don’t forget to change the password!
86 Chapter 2. Documentation
CHAPTER 4
Contribute
You can open an issue on this repository for any feedback (bug, question, request, pull request, etc.).
89