why websites block spiders, crawlers and bots · why websites block spiders, crawlers and bots 2...

www.anonymizer.com

Why Websites Block Spiders, Crawlers and Bots


www.anonymizer.com 2

There are quite a few reasons to have automated systems, like spiders, crawlers and bots, visit websites. Automated system visits allow search engines to index pages, researchers to look up trends, and marketing firms to conduct competitive analyses. In many of these cases, the site operator actually wants their site visited by the automated system; however, there are times where restricting access is important. Below, we are going to look at some of the reasons websites choose to block access by automated systems.

Competitive and/or Sensitive Information

Sites that sell their goods online will often want to reach their customers, but prefer not to provide the same information to their competitors. This seems to happen frequently in competitive markets, such as consumer retail and travel industries. In these cases, a site might block or even provide incorrect information if accessed from networks known to be associated with competitors. Many organizations customize the information they present on their website based on information they derive from the network address. Usually this is innocuous, but sometimes it’s used to mislead competition. One example of this is a news site that provides local news based on where they believe you are connecting from. Another example is when the target site adjusts the price of goods based on a visitor’s browser, OS, or system type. (Mac users get to pay more!) The only way around this is to access the site from another network address while modifying the other attributes you are providing.

Expense

In our experience, websites also start blocking access to their site because they are incurring higher than expected costs due to automated systems accessing their site. If your system suddenly gets blocked shortly after the start of the month, it is likely because of your high traffic on their site in a very short period of time. Many system administrators first realize that their site is being scraped when they get their monthly bandwidth bill. A site that normally receives a small number of visits and then suddenly gets thousands of visits per day can see their bandwidth charges escalate quickly.

There is little a website can do to control this other than block the IP address(es) that are frequently visiting the site. Sites that are providing services will also look to restrict access. Search engines frequently fall into this category, as they make money by advertising. They need revenue to offset the expense of running the infrastructure of the search engine. Since automated systems tend not to make purchases or be swayed by advertising, they simply increase the cost of running the site. The best you can do to minimize your risk of being blocked is to spread your traffic out over a longer time period and across a large pool of IP addresses.

Performance

Sites start to restrict access as well if the load from automated systems is degrading the site’s performance. Many smaller websites don’t run on a farm of servers in a data center with high-speed network access. Rather, they operate on old hardware that happened to be available and are sharing an Internet connection with the rest of the organization. This arrangement is generally adequate when the volume of traffic to the site is low, but if a good machine with a fast Internet connection starts aggressively scraping a site, it tends to look like a Denial of Service (DOS) attack.

http://www.ntrepidcorp.com


www.anonymizer.com 3

The best way to avoid this is to monitor the response times from the site you are crawling and increase the traffic to the site in a controlled way. This means it may take you longer to get the information you want, but it can mean the difference between success and permanently being blocked. Note that some sites will provide a robots.txt file, which you should attempt to honor. The specific sections of the site they want robots to ignore are computationally intensive, meaning it is likely to cause performance problems for the entire site if they are aggressively requested.

Privacy, Copyright and Legal Issues

Sometimes a site is designed to make data available to a limited audience or in a limited way for legal reasons. In these cases, the site may have restrictions on where it can be accessed from, how frequently it can be accessed, or even when it can be accessed. When this occurs, the site is not specifically setting out to block automated access, but the net effect is the same. Some examples of this are sites that distribute content containing restricted copyrights or are designed to provide personal details to individuals, such as account information. Generally speaking, accessing these sites in an automated fashion will violate their terms of services and should be avoided. However, access from a particular geographic area might remove the block and not violate the site’s intent.

Preventing Bad Behavior

Some sites, like Web-based email systems, provide services that are natural targets for abuse. These providers don’t want to be used to aid spammers in sending unwanted email, so they will take aggressive measures to prevent automated access. This will likely include CAPTCHAs, text message confirmation, or other approaches that attempt to directly validate that a human is accessing the site. Generally, there is little legitimate reason to access these sites in an automated fashion.

What Can You Do?

If you have a legitimate reason for accessing a site with an automated tool and are running into some issues despite using reasonable restraint, please take a moment and look at our Ion attribution control solutions (www.anonymizer.com). Ion by Anonymizer integrates easily with most commercial and open source web harvesting platforms. When you route your automated scraping activities through Ion, your IP address remains anonymous and unattributable to prevent blocking countermeasures.

Contact us to learn more:[email protected]

or 800.921.2414

©2019 Anonymizer. All rights reserved.

http://www.ntrepidcorp.com

http://www.ntrepidcorp.com/ion

mailto:[email protected]

why websites block spiders, crawlers and bots · why websites block spiders, crawlers and bots 2...

Documents