intelligent bot management | account takeover … · adaptive machine learning • behavioural...

INTELLIGENT BOT MANAGEMENT | ACCOUNT TAKEOVER PREVENTION | SEAMLESS CDN INTEGRATION

STOP SOPHISTICATED BOTS ON YOUR WEBSITE & MOBILE APPS

CARD FRAUD CHECKOUTABUSE

WEBSITE RECON APPLICATION DDOS

FAKE ACCOUNTCREATION

BRUTE FORCEATTACK

CREDENTIALSTUFFING

AD/CLICK FRAUD

SKEWED ANALYTICS SCRAPING

POWERED BY BEHAVIOURAL ANALYSIS AND MACHINE LEARNINGNetacea’s intelligent and agile approach detects and stops threats including:

WHY NETACEA?

• ADAPTIVE MACHINE LEARNING• BEHAVIOURAL ANOMALY DETECTION• FAST & ACCURATE• DEFEND DATA & IP AGAINST THEFT• PROTECT AGAINST REPUTATIONAL DAMAGE

• FRICTIONLESS USER EXPERIENCE• CHOICE OF DEPLOYMENT OPTIONS• COMPLEMENTS EXISTING WAF & CDNs• ACTIONABLE THREAT INTELLIGENCE• POLICY-BASED APPROACH

A NEW APPROACH TO ACCOUNT TAKEOVER ATTACKS

Learn more at www.netacea.com

FREE TRIAL

https://www.netacea.com/bot-detection?utm_source=oreilly&utm_medium=oreilly&utm_campaign=TATH

https://www.netacea.com/?utm_source=oreilly&utm_medium=oreilly&utm_campaign=TATH

Andy Still

Managing and MitigatingBots

The Automated Threat Guide

Boston Farnham Sebastopol TokyoBeijing Boston Farnham Sebastopol TokyoBeijing

978-1-492-02935-9

[LSI]

Managing and Mitigating Botsby Andy Still

Copyright © 2018 O’Reilly Media, Inc. All rights reserved.

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA95472.

O’Reilly books may be purchased for educational, business, or sales promotional use.Online editions are also available for most titles (http://oreilly.com/safari). For moreinformation, contact our corporate/institutional sales department: 800-998-9938 [email protected].

Editor: Virginia WilsonProduction Editor: Nicholas AdamsCopyeditor: Jasmine KwitynInterior Designer: David Futato

Cover Designer: Randy ComerIllustrator: Rebecca DemarestTech Reviewers: Daniel Huddart, AndyLole, and Jason Hand

March 2018: First Edition

Revision History for the First Edition2018-03-16: First Release2019-03-05: Second Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Managing andMitigating Bots, the cover image, and related trade dress are trademarks of O’ReillyMedia, Inc.

While the publisher and the author have used good faith efforts to ensure that theinformation and instructions contained in this work are accurate, the publisher andthe author disclaim all responsibility for errors or omissions, including without limi‐tation responsibility for damages resulting from the use of or reliance on this work.Use of the information and instructions contained in this work is at your own risk. Ifany code samples or other technology this work contains or describes is subject toopen source licenses or the intellectual property rights of others, it is your responsi‐bility to ensure that your use thereof complies with such licenses and/or rights.

This work is part of a collaboration between O’Reilly and Netacea. See our statementof editorial independence.

http://oreilly.com/safari

http://www.oreilly.com/about/editorial_independence.html

http://www.oreilly.com/about/editorial_independence.html

Table of Contents

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Part I. Background

1. What Is Automated Traffic?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Key Characteristics of Automated Traffic 4Exclusions 4

2. Misconceptions of Automated Traffic. . . . . . . . . . . . . . . . . . . . . . . . . . 7Misconception: Bots Are Just Simple Automated Scripts 7Misconception: Bots Are Just a Security Problem 9Misconception: Bot Operators Are Just Individual Hackers 9Misconception: Only the Big Boys Need to Worry About

Bots 10Misconception: I Have a WAF, I Don’t Need to Worry

About Bot Activity 11

3. Impact of Automated Traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Company Interests 13Other Users 14System Security 14Infrastructure 15

iii

Part II. Types of Automated Traffic

4. Malicious Bots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Application DDoS 20

5. Data Harvesting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Search Engine Spiders 22Content Theft 22Price Scraping 24Content/Price Aggregation 25Affiliates 26User Data Harvesting 26

6. Checkout Abuse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Scalpers 28Spinners 29Inventory Exhaustion 30Snipers 30Discount Abuse 31

7. Credit Card Fraud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Card Validation 33Card Cracking 34Card Fraud 34

8. User-Generated Content (UGC) Abuse. . . . . . . . . . . . . . . . . . . . . . . . 35Content Spammer 36

9. Account Takeover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Credential Stuffing/Credential Cracking 37Account Creation 38Bonus Abuse 39

10. Ad Fraud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Background to Internet Advertising 42Banner Fraud 44Click Fraud 45CPA Fraud 46Cookie Stuffing 46Affiliate Fraud 47Arbitrage Fraud 48

iv | Table of Contents

11. Monitors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Availability 52Performance 52Other 52

12. Human-Triggered Automated Traffic. . . . . . . . . . . . . . . . . . . . . . . . . 53

Part III. How to Effectively Handle Automated Traffic inYour Business

13. Identifying Automated Traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Indications of an Automated Traffic Problem 57Challenges 58Generation 0: Genesis—robots.txt 60Generation 1: Simple Blocking—Blacklisting and

Whitelisting 60Generation 2: Early Bot Identification—Symptom

Monitoring 61Generation 3: Improved Bot Identification—Real User

Validation 62Generation 4: Sophisticated Bot Identification—Behavioral

Analysis 64

14. Managing Automated Traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Blocking 68Validation Requests 69Alternative Servers/Caching 71Alternative Content 71

Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Table of Contents | v

Introduction

Web traffic consists of more than just the human users who visityour site. In fact, recent reports show that human users are becom‐ing a minority. The rest belongs to an ever-expanding group of traf‐fic that can be grouped under the heading automated traffic.

Terminology

The terms automated traffic, bot traffic, and non-human traffic are equally common and are used inter‐changeably throughout this book.

As long ago as 2014, Incapsula estimated that human traffic onlyaccounted for as little as 39.5% of all traffic they saw. This trend ispredicted to continue, with Cisco estimating that automated trafficwill grow by 37% year on year until 2022.

However, this is not simply a growth in the quantity of automatedtraffic but also in the variety and sophistication of that traffic. Newparadigms for interaction with the internet, more complex businessmodels and interdependence between sources of data, evolution ofshopping methods and habits, increased sophistication of criminalactivity and availability of cloud-based computing capacity, are allconverging to create an automated traffic environment that is evermore challenging for a website owner to control.

vii

It’s Not All Good or BadIt is simplistic to think of automated traffic as being all goodies andbaddies. However, the truth is much more nuanced than that. Aswe’ll discuss, there are clear areas of good and bad traffic but thereis a gray area in between where you will need to assess the positivityor negativity for your situation.

This growth poses a number of fundamental questions for anyonewith responsibility for maintaining efficient operation or maximumprofitability of a public-facing website:

• How much automated traffic is hitting my website?• What is this traffic up to?• How worried should I be about it?• What can I do about it?

The rest of this book will help you understand how you can provideanswers to these questions.

Terminology

The challenge of automated traffic applies to anyonewho runs a public-facing web-based system, whetherthat is a traditional public website, complex web-basedapplication, SaaS system, web portal, or web-basedAPI. For simplicity I will use the generic term websitewhen referring to any of these systems.Likewise I will use website owner to refer to the rangeof people who will be responsible for identifying andmanaging this problem—from security and platformmanagers to ecommerce and marketing directors.I will use the term bot operator to identify the individ‐ual or group that is operating the automated traffic.

viii | Introduction

PART I

Background

Before going into detail about what automated traffic is doing onyour website and how this can be addressed it is important that wehave a good shared understanding of what automated traffic encom‐passes.

The following chapters will give a brief introduction to core ele‐ments of automated traffic and clarify some of the common miscon‐ceptions that people hold about the nature and complexity of bottraffic and the bot operators.

CHAPTER 1

What Is Automated Traffic?

There is a range of different definitions of what can be classed asautomated traffic.

For example, Frost & Sullivan describe bot traffic as “computer pro‐grams that are used to perform specific actions in an automatedfashion,” Akamai has defined it as “automated software programsthat interact with websites,” and Wikipedia defines a bot as “a soft‐ware application that runs automated tasks (scripts) over the Inter‐net,” whereas Hubspot says “A bot is a type of automated technologythat’s programmed to execute certain tasks without human interven‐tion.”

For the purposes of this book I will use the following description forautomated traffic, which I feel captures the essential details of whatis meant by the term and removes some of the vagaries included inthe other descriptions:

Automated traffic is any set of legitimate requests made to a websitethat is made by an automated process rather than triggered by adirect human action.

History of Automated TrafficThe history of the type of automated traffic I am discussing herecan be traced back to 1988 with the creation of IRC bots such as theHunt the Wumps game and Bill Wisner’s Bartender. It wasn’t until1994, however, that the first search engine spiders were created by

3

WebCrawler (later purchased by AOL). GoogleBot followed in1996.

Key Characteristics of Automated TrafficFor the purposes of this book, I will have a limited definition ofautomated traffic; this is not to say that other types of automatedtraffic are not a concern, just that they are addressed elsewhere.

Web-based SystemsThe automated traffic discussed in this book is targeted at web-based systems and excludes other types of traffic, such as automatedemails.

Layer 7Automated traffic operates at layer 7 of the OSI Model—in otherwords, it operates at the application level, making HTTP/HTTPSrequests to websites and receiving responses in the same format.Anything that interacts with servers via any other means is classedas outside the scope of this book.

Legitimate RequestsAutomated traffic is defined as traffic that makes legitimate requeststo websites (i.e., requests formulated in the same way as those madeby human users). This means that the automated traffic that is iden‐tified as negative is focused on exploiting weaknesses in businesslogic of systems, not exploiting security weaknesses.

ExclusionsThe following types of traffic, which could be categorized as auto‐mated traffic, have been excluded from any discussion within thisbook. The reason for this exclusion is that they are subjects in theirown right and are well catered for in other literature, with a range ofwell-established products and solutions in existence to mitigate theissues created.

Their exclusion from this work does not imply that they are notworthy subjects of concern for website owners. They are, in fact,

4 | Chapter 1: What Is Automated Traffic?

http://bit.ly/2B9hC3e

very real threats that should be handled as part of any website man‐agement strategy.

DDoS (Distributed Denial of Service)DDoS is a low-level volumetric attack, designed to overwhelm theserver by the quantity of requests being made. There are a widerange of different attacks that can be made to achieve this objective,all of which aim to exploit weaknesses in networking protocols. Tomitigate this, there are well-established, dedicated DDoS manage‐ment tools and services that can be put in place to minimize riskfrom DDoS attacks.

A variation on this called application DDoS aims to make largenumbers of requests for certain, known pressure points within sys‐tems, with the intention of bringing the system to its knees. This willbe discussed in more depth in Chapter 4.

Security Vulnerability ExploitsThese types of exploits involve attempts to make illegitimaterequests to a system with the aim of exploiting weaknesses withinthe security of a system allowing the operator to gain control overthe server or data within the application. Common examplesinclude SQL injection and cross-site scripting.

Hackers employ constant automated scripts that execute across theinternet looking for sites/servers where these vulnerabilities havenot been mitigated. Well-managed servers and good applicationdevelopment can protect systems from these exploits, but it is also agood practice to use a web application firewall (WAF) to identifyand block illegitimate requests to further minimize risk from theseor future exploits.

These automated scans and attacks are a real threat and should betaken seriously by anyone who has responsibility for the security ofa website.

Exclusions | 5

http://bit.ly/2DUbPVm

CHAPTER 2

Misconceptions ofAutomated Traffic

As we’ve already discussed, the amount of automated traffic is grow‐ing consistently and as it rises so too does the sophistication andcomplexity of the bot operators. Before discussing the activities ofbot traffic in detail, it is worth addressing some of the common mis‐conceptions that website owners may have about automated traffic.

Misconception: Bots Are Just SimpleAutomated ScriptsWhile this may have been accurate 15 years ago, the level of sophis‐tication of bot traffic has been increasing massively as both thetechnology and platforms available to bot operators and the sophis‐tication of defenses in place increases and, most importantly, thegains to be achieved increase.

Modern bots are sophisticated systems that will manage distributionof traffic across large-scale environments or large botnets and viamultiple proxies in order to hide their activity among that of humanusers (even executing requests as a part of a human session). Botswill routinely execute requests from real browsers and execute Java‐Script sent to validate users as humans. Detection mechanisms suchas CAPTCHA can be bypassed, either by using artificial intelligenceor brute-force systems, or by employing farms of human agents tosolve them on demand and pass the solution back to the bot. Bots

7

are intelligent enough to integrate with these human services seam‐lessly.

Botnet

Botnets are networks of compromised computers(usually infected by viruses or other malware) that canbe accessed remotely and used to execute any pro‐cesses defined by the botnet operator. Often this meansthey are used to send requests to remote machinesover the internet.They are more commonly associated with being usedfor DDoS attacks but can be used for automated traffic(e.g., account takeover or card validation attempts).There is an increasing number of botnets being madeavailable for hire.

Multiple bot activities can be coordinated into a complete system.For example, data harvesting will be undertaken to get productdetails from a site to identify appropriate products to target, thencheckout abuse will be undertaken to create more valuable advertis‐ing subjects, and finally ad fraud will be undertaken—and all ofthese activities can be viewed and coordinated from a central controlpanel.

Similarly, ticket touts will use spinner bots to hold a ticket and thentrigger another bot to automatically add this ticket to a secondaryticketing site. When the ticket is sold the original bot will completethe purchase. A central management system is in place to see thestatus of tickets being held/purchased and to handle distribution oftickets to end purchasers. Additional software is used to then mod‐ify the downloaded tickets to reflect the new purchaser’s details.

These are just some examples of the sophistication seen in bot activ‐ity and this level is increasing constantly to exploit weaknesses insystems, business logic, and practices and to stay ahead of thedefense mechanisms that are constantly being improved.

8 | Chapter 2: Misconceptions of Automated Traffic

Misconception: Bots Are Just a SecurityProblemThe challenge of managing automated traffic is often just dropped atthe door of an information security officer (ISO) and the securitydepartment, if the company has one. For some types of automatedtraffic (such as credit card fraud) this makes absolute sense becauseit is definitively a security issue and should be handled as such.

However, some other types of automated traffic (such as price aggre‐gators) are actually business considerations and should be managedas such by a relevant section of the business.

There are a number of other roles that may be involved in makingdecisions about the varying types of and challenges raised by auto‐mated traffic. These can include roles such as Head of Platform,Head of Ecommerce, Head of Ops, and Head of Marketing.

The ideal management solution will provide sufficient informationto allow people in these roles to view details of and make informeddecisions about how to manage the elements of automated trafficspecific to their roles without being dependent on a black boxsecurity-based system.

Misconception: Bot Operators Are JustIndividual HackersObviously, we all know that there are extremely large organizationsthat operate automated traffic networks (think Google) and belowthat there are a group of organizations that are scraping data forlegitimate purposes (price aggregators, etc.) but beyond that there issometimes a sense of a distributed set of lone hackers developingsoftware to perpetrate scams or to sell to companies to spy on theircompetitors.

While there is no doubt that such individuals exist, it is far from thetruth about all bot operators. The amount of money that can bemade with some types of automated traffic means that they are, inreality, complex criminal organizations employing technical expertsand backed by human endeavor at an organizational, strategic leveland also at a lower level to complete manual tasks that are out of thescope of bot activity (e.g., completing CAPTCHAs).

Misconception: Bots Are Just a Security Problem | 9

There is also an increasing trend for the existence of third-partyservices that are focused on delivering automated traffic activity ondemand. For example, there are a range of companies who offerprice/content scraping services on a per-use basis, and will provideall standard bot evasion techniques as standard (and they are con‐stantly working to improve the reliability of their evasion techni‐ques). This means that rather than your competitors building a pricescraping bot in house or by using a freelancer they now have accessto a service that is dedicated to evading bot detection in order tomaintain income. Other third parties such as ticket bots, sneakerbots, and CAPTCHA farms are all being created to further increasethe sophistication of automated traffic being made available to usersboth legitimate and dubious (as well as end consumers, as is some‐times the case with sneaker bots).

Misconception: Only the Big Boys Need toWorry About BotsThere can sometimes be a feeling that there are two types of bots:

• Generic bots that are targeted at spotted untargeted weaknessesin large numbers of sites

• Targeted bots that focus on specific, high-profile sites

This can lead to a false sense of security for website owners of mid-sized sites—they might feel that, as long as they have some generalsecurity protection in place, then the bot operators are never goingto go to the effort of targeting their site.

In reality, this is untrue: smaller sites tend to have fewer defenses, soare easier targets, and although solutions will need to be evolved tobe targeted to a specific site, this is often not as much work as mightbe imagined. The frameworks that have been built are sophisticatedto allow for easy expansion and the available resources are such thata wide range of websites can be targeted.

Small and mid-sized commercial online presences have been shownto be equally targeted by automated traffic activity.

10 | Chapter 2: Misconceptions of Automated Traffic

Misconception: I Have a WAF, I Don’t Need toWorry About Bot ActivityWeb application firewalls (WAFs) are very useful tools that form afundamental part of a secure system. They are similar to networkfirewalls, but rather than operating at a TCP/IP level, they operate atthe HTTP level to process all incoming requests and match eachrequest against a set of static rules, blocking requests that fail thechecks. They are, therefore, very effective at stripping out vulnera‐bility scanning attempts such as SQL injection attacks.

However, WAFs are not well suited for identifying bot traffic, as thechallenge of spotting automated traffic is fundamentally different.Basically, WAFs scan web traffic looking for illegitimate requestsdesigned to exploit security weaknesses in web applications, whereasbot detection systems need to scan web traffic looking for legitimaterequests that are aiming to exploit weaknesses in the business logicof a web application. Typically this involves making a judgment afteranalyzing the series of requests made to look for patterns of behav‐ior that differ from legitimate users (either human or good bot).

Misconception: I Have a WAF, I Don’t Need to Worry About Bot Activity | 11

CHAPTER 3

Impact of Automated Traffic

Before deciding on how to manage the automated traffic that is hit‐ting your system, it is important that you have effectively assessedthe impact it is having, weighed against the value it is delivering toyou. When considering the impact you need to be sure that you arenot just considering the impact on your servers but also the businessimpact. In addition, sufficient investigation must be undertaken todetermine the intent of the bot operator and to understand whatthey were actually trying to do when executing the automatedattack.

It’s important to realize that non-human traffic can deliver valuewhile also having a negative impact on your business. In this case,you must assess the relative importance of the non-human traffic todeduce whether the benefits of this traffic outweigh the negativeeffects.

When assessing the impact consider the impact on company inter‐ests, other users, system security, and infrastructure. Let’s nowexamine each of these in turn.

Company InterestsIs the automated traffic accessing your site for purposes that wouldnot be in the interests of your company?

13

Examples of this include:

• Competitors who are scraping your prices so that they canadjust their pricing accordingly, putting them at a competitiveadvantage.

• Bots stealing your content to use on their sites, saving them thecosts of creating that content or purchasing data feeds.

• Spambots utilizing areas of your site that allow user-generatedcontent (UGC), such as comments or forums, to publish offen‐sive content or ads for services you would not want your com‐pany associated with.

• Account takeover bots accessing people’s personal data for useelsewhere.

• Scalpers purchasing limited availability goods for resale else‐where creating a negative public opinion of your brand.

• Creation of fake accounts in order to take unfair advantage ofspecial offer terms.

• Skewing of analytics and other metrics that would lead you tomake invalid business decisions.

Other UsersDoes the non-human traffic negatively affect the experience ofhuman users?

This could be in terms of the quality of service that you are able toprovide to them—for example, the non-availability of products dueto bots removing inventory from sale, or the variation in price ifdynamic pricing is being influenced by bot activity.

Alternatively, it could be impacting users due to the effects on theperformance of the system, such as the deterioration of site responseas a result of the higher traffic on the system.

System SecurityIs the non-human traffic trying to identify or exploit security weak‐nesses in your site?

14 | Chapter 3: Impact of Automated Traffic

Is this traffic trying to bypass your system defenses in order to gainaccess to areas of the system that should not be publicly available,such as bypassing password-protected areas of the system to gainaccess to user’s personal/financial data or to steal credit associated tothat account.

As previously discussed, there is a whole range of security exploitsthat can be identified by security software that will regularly be scan‐ning your site. These are outside the scope of this book but theimpact of allowing them to hit your site without appropriate man‐agement in place can be catastrophic, including complete loss ofcontrol of servers and compromise of data.

Poor security can make your site a target for some of the other typesof automated traffic attacks described in this book, such as cardingor data theft. A robust approach to security management is essentialto reduce the risk of reputational damage from a wide range ofpotential attacks.

InfrastructureDoes the non-human traffic affect your infrastructure?

System performance can be negatively impacted by automated traf‐fic—for example, servers might reach capacity and therefore strug‐gle to return content or process requests in an appropriate manner.

Alternatively, it could affect your scalability, meaning you hit limitssuch as disk space required for logs, cache, or database storage orsoftware licence limits much sooner than expected.

All of these can further result in a negative impact on costs. Thiscould be due to increased bandwidth usage because of the amountof data being returned to automated processes, additional storagecosts, or additional infrastructure or software licences required torun the site.

If you are scaling up your infrastructure to meet high demand fromautomated traffic and are not in a flexible cloud environment thenyou will be paying for a level of capacity far greater than that neededto meet the business needs of the platform just to maintain userexperience during bot attacks.

Infrastructure | 15

In many cases the savings associated with reduced infrastructureand bandwidth costs can be sufficient to justify investing in an auto‐mated traffic management solution.

This impact is intensified by the timing of the automated traffic inrelation to the peak hours seen by the business. Search engines andother legitimate automated traffic that you may rely on will usuallywork with you to ensure that they are not conflicting with your peaktrading hours.

16 | Chapter 3: Impact of Automated Traffic

PART II

Types of Automated Traffic

As discussed in Part I, there are a wide variety of activities that botsmay be accessing your system to carry out. In some cases, theseactivities might be beneficial to you (or otherwise benign) but oftenthe intent is malicious. When looking at your traffic to identify andmanage automated traffic it is essential that you understand theintent of this traffic and how that could impact you as a websiteowner.

In many cases, even having identified that there is automated traffichitting your website, it can be difficult to understand what that traf‐fic is up to. Only when you understand the intent is it possible to putin place a management strategy to handle it. Part of understandingthe intent is understanding the benefit the bot operators can achievefrom the actions they are undertaking.

The chapters in this part will offer some depth in understanding ofthe most common types of automated traffic and provide back‐ground into the intent of that type of traffic.

Alternative CategorizationThere are a number of alternative categorizations that have beencompleted into automated traffic. For example, OWASP’s Automa‐ted Threat Handbook groups the threats into 21 different categories(these include security threats that are outside the scope of thisbook).

To try and further simplify things I have categorized all types of bottraffic under nine broad headings:

• Malicious bots• Data harvesting• Checkout abuse• Credit card fraud• User-generated content (UGC) abuse• Account takeover• Ad fraud• Monitoring• Human-triggered automated traffic

Chapters 4 through 12 discuss each of these categories in moredetail.

http://bit.ly/2mCIaVi

http://bit.ly/2mCIaVi

CHAPTER 4

Malicious Bots

This category includes bot activity that is designed simply to have anegative impact on a website rather than the negative impact being aby-product of another activity designed to benefit the bot operatordirectly.

This bot activity has a lot of overlap with other types of securityattacks, such as DDoS attacks or vulnerability exploits (as alreadymentioned, these are not covered at length here, as there is a wealthof literature on these topics). Attacks like this are usually orchestra‐ted by groups wanting to hold companies to ransom in return forstopping the attacks, groups who have ideological objections tocompanies’ activities, or occasionally, malicious competitors.

Traditionally the use of automated traffic as we define it here has notbeen the means of malicious attack but this will change as defenseagainst DDoS attacks and other security systems improve.

Good or Bad?This type of automated traffic is always bad—the entire intent hereis to be malicious.

Let’s now take a look at an example of malicious bots.

19

Application DDoSThe objective of a DDoS attack is to undertake a large amount ofactivity such that the server under attack is unable to provide theservice that it is in place to provide. While this is partly done by thequantity of traffic, it is also done by forming the network requests insuch a manner as to exploit weaknesses in the network protocolsthat make a failure more likely. As a very simple example, this couldbe simply opening network connections to a server and then keep‐ing that connection open with minimal interaction until the serverruns out of available connections.

Application DDoS takes a similar approach but at an applicationlevel. Rather than exploiting weaknesses in the network protocol, itlooks for areas of application functionality that will struggle whenthe application is under load. These could be areas that involve highprocessor usage, integration with third-party systems, or complexdatabase activity. Often these will be areas such as search, log in,availability checks, or real-time booking requests but will vary witheach website. The bot traffic will then just automate repeatedrequests to these areas of the website until the site reaches a limitand falls over or is unable to transact normally with legitimate cus‐tomers.

These attacks are usually well hidden, rotating IP addresses andlegitimate user agents and are often launched via botnets.

20 | Chapter 4: Malicious Bots

CHAPTER 5

Data Harvesting

This category captures a range of traffic types that will access thepublicly available information contained within your website andcapture that data for use elsewhere. Typically this will involve access‐ing many pages and extracting relevant data using text patternmatching.

Good or Bad?Data harvesting covers the full range of motives, from good to bad.

Some data harvesting bots, such as search engine spiders, areclearly usually regarded as good, with many businesses dependingon these as the source of their traffic. Likewise, affiliates would typi‐cally be driving traffic to your site.

Conversely, there are data harvesting bots that are clearly bad, suchas those engaged in content theft or price scraping.

There are also data harvesting bots that exist in the gray areabetween good and bad—for example, price and content aggregatorsthat may be legitimately driving traffic to your site but in a way thatmay not be in the interests of your business.

Let’s now look at some specific examples of data harvesting bots.

21

Search Engine SpidersThe most common form of data harvesting and the one withoutwhich the internet as we know it today wouldn’t function is thesearch engine spider. The most common of which is, of course,GoogleBot, but there are many others from a range of global as wellas regional or specialist search engines. These bots will usually enteryour site via the homepage or via a deep link from another site andthen follow all active links until they have accessed all pages withinthe website. Each request leads to multiple other requests, hence thename spider.

Search engine spiders are generally well behaved, legitimately identi‐fying themselves in a user-agent string, following robots.txt instruc‐tions, not attempting to bypass any security mechanisms and notusing your data for anything beyond populating their search resultsand therefore driving traffic to your system. They can, however,sometimes be aggressive in the rate of requests that they make, andas they will often make thousands of requests, this can create short-term pressure on underlying infrastructure. Some search enginesallow you to define the rate of request that is applied to your websiteto mitigate this impact.

Most people welcome search engine spiders, and in fact, put changesin place to optimize their sites for the needs of search engines as aprimary driver of traffic, so typically this will not be a type of trafficthat you will want to take any action against. If the overhead is toohigh, however, perhaps some action could be taken against regionalsearch engines—for example, if your business does not operate inChina, you might take action against the Chinese search engineBaidu.

Any action taken to optimize the experience of this importantsource of automated traffic must be taken very carefully as mostsearch engines, especially Google, are on the lookout for the practiceof cloaking—that is, adjusting the experience and content returnedto spiders so that it differs from that seen by real users visiting thesame page, as this undermines the accuracy of their search results.

Content TheftA variation of a search engine spider is a bot that is visiting a site tosimilarly extract data but this time to use the content extracted else‐

22 | Chapter 5: Data Harvesting

where, without the consent of the website owner. Content, in thisrespect, can be either content you have created yourself (journalism,opinion pieces, thought leadership, etc.) or content that you haveextracted from paid data feeds to display within your website (e.g.,sports statistics, product information, etc.).

In some situations content theft involves bypassing a paywall, andthere are several methods employed to do this.

First, by abusing the Google First Click Free policy, which said thatthe first three articles clicked through to your site from Googleshould be free and only subsequent clicks will activate the paywall.Bot traffic can generally bypass the methods put in place to enforcethis process. Google has recently relaxed this policy to allow paywallsites a wider set of options for how they can integrate with Googlesearch results.

Second, by creation of fake accounts to take advantage of free trialperiods (see “Bonus Abuse” on page 39).

And finally, by logging into a legitimate account by cracking some‐one’s username and password (a full discussion of account takeoveris presented in Chapter 9).

After the data is harvested, these bots can engage in a number of dif‐ferent activities, including the following:

• Using your content within their own competitive site to providea similar experience to their customers without the cost of cre‐ating or purchasing the data.

• Using the content within a scam version of your site that usestheir own advertising in place of yours. This can include spe‐cialist browser plug-ins that will intercept requests for your sitecontent and substitute it with the alternative site content. Thesesites may include advertising for goods and services that youwouldn’t want your brand associated with, so in addition to los‐ing customers you could also experience brand damage.

• Distributing your content to wider groups of consumers than isallowed in terms of use (typically, this is more common whenthere are paywalls in place on the site being poached from).

Although content theft bots are conceptually similar to spiders theyare much less well behaved. They will usually disguise themselves ashuman users typically using a browser user agent and will ignore

Content Theft | 23

instructions defined in robots.txt. Whereas spiders are designed toextract data from all sites, content theft is usually much more targe‐ted, being tuned to extract specific content from target websites.

Content theft bots will often employ more sophisticated methodolo‐gies, to evade protections put in place such as rotating IP addressesand varying request rates and intervals, to evade whatever protec‐tions you have put in place.

Price ScrapingA more specific form of content theft is price scraping (or odds scrap‐ing in the gambling industry). This involves the extraction from awebsite of specific data relating to the pricing of goods (or equiva‐lent, such as the odds being offered for placing bets on a specificoutcome). Price scraping is often undertaken by competitors toimplement an effective price matching strategy, in fact many compa‐nies make public declarations that they will always be the cheapestprice available, sometime even displaying the prices of competitorson their product pages.

Sophisticated price matching strategies take account of the availabil‐ity of products on competitor sites and adjust pricing accordingly—that is, offering goods at a higher price when they are unavailableelsewhere and dropping the price when competitors have availabil‐ity. Constant price scraping also allows for real-time reaction tocompetitor discounting and special offers. The more sophisticatedecommerce platforms will automate the price adjustments theymake based on the data received from competitor price scraping,meaning that any discount applied by a competitor can be matchedvery quickly with no need for human interaction.

Price scrapers, to be effective, must take all actions they can to avoiddetection as the obvious defense against them would be to startresponding with incorrect data, meaning your competitors are mak‐ing invalid pricing decisions.

While traditionally these were usually internally developed pro‐grams aimed at specific competitors, they are increasingly beingprovided by specialist third parties as a service.


Content/Price AggregationThis is a particular variant on content theft and price scraping, whenthe data being harvested is used to gather together groups of similartypes of data to display in a single place for the benefit of users.Examples of this would be price comparison sites where the pricesfrom a range of different sites for the same product are gathered anddisplayed to the user, allowing them to make a buying decisionwithout needing to visit many different sites.

When run ethically, these sites will display data as extracted fromyour site accurately and attribute the source of the data, with theobjective being to drive users to your site to purchase the product orview the full content. In this sense they would argue that they are apositive benefit for you as they will drive more users to your site.These types of service will usually not make any attempts to hidetheir activities, including using accurate user-agent identificationand should be willing to interact with an API, if you have one avail‐able, rather than scraping your site or to remove your site from theirlist of sites they are aggregating at your request.

However, there is an evolution of these sites where, though they areaggregating information, they are less open about the source of thatinformation. This can include content sites where they correctlyattribute the source of the content but display the content in fullwithin their site but without linking through to the source of thecontent or displaying advertising.

Alternatively, this could be sites that aggregate the prices for goodsand then complete the purchase of goods via your website but withthe appearance that the purchase was via the third-party site. Thismay seem as if it is beneficial but can present several issues. First, itremoves the possibility of providing any up-selling opportunities.Second, there is no assurance that any promises or product descrip‐tions made ahead of the purchase are accurate. Lastly, it removescontrol about the sale of your goods from the strategy that you havedefined as a company.

This type of site is most common in the travel industry where cer‐tain airlines do not want to be part of any price comparison sites butthose sites will scrape their content and embed it without permis‐sion. There are several examples of Ryanair successfully taking legalactions against sites who have used their data in this way.

Content/Price Aggregation | 25

AffiliatesAs discussed earlier, it is common to see bot traffic being initiated byaffiliates in order to resell your goods or services. These bots shouldbe open about their activity and willing to work with you to ensurethat any automated activity does not result in any negative impacton your system.

Best practice is to provide an API with usage tracking to ensure thatthe actions of affiliates are in line with the best interests of yourbusiness.

Abuse of affiliate systems is discussed in Chapter 10.

User Data HarvestingSometimes content is harvested from your site, not for your data butto cull personal data from the user-generated content areas of yoursite, such as reviews, comments, discussions, pictures, and so on.

This data can be valuable to bad actors who are trying to build pro‐files of individuals, either to undertake some negative actionsagainst those particular users or as part of taking advantage of theinformation contained in the online profiles to undertake otheractivities such as ad fraud where a user with a similar profile wouldbe more valuable.


CHAPTER 6

Checkout Abuse

Checkout abuse is when automated traffic looks to bypass or inother ways manipulate the checkout process on ecommerce sites togain personal or business advantage. This will involve automatingthe process of interacting with website ecommerce processes, suchas adding to baskets and completing checkout processes. As theseprocesses vary from site to site the process has to be created andimplemented for each site targeted.

These are segmented from credit card abuse bots because these areactivities that are not, in themselves, aimed to defraud the websiteowner (that is not to say that they are not violating some legalrestrictions or terms and conditions of the website).

Good or Bad?From a business point of view, most prominent examples of check‐out abuse could be seen as positive as they will effectively completesales quickly and efficiently (this is assuming that the purchase iscompleted using valid and legal payment means). Large amounts ofcheckout abuse can result in much higher throughput of sales forhigh-demand items.

The negative side of checkout abuse can primarily be judged by theimpact that it has on other users and therefore the negative branddamage that can be associated with this. There is often customerbacklash, which is then picked up by the media, when items aresold out in very short periods of time to non-human traffic, espe‐

27

cially as those items are often made available for sale via re-sale sitesfor vastly marked up prices.

Some of the more complex variants of this, such as inventory grab‐bers and spinners, have such clearer negative impact on businessinterests.

We’ll now look at some specific types of checkout abuse.

ScalpersScalpers are automated processes that will capture goods and com‐plete the checkout process in a fully automated fashion. The mostcommon example of this is in event ticketing (where high demandtickets are purchased for re-sale at inflated prices), but it is alsobecoming increasingly common in the fashion industry as labelsrelease limited edition items that are of interest to collectors.

Scalpers take advantage of the fact that they can complete processesmuch more quickly than human users so have a much higher chanceof being successful during busy periods. They will also usuallylaunch many separate attempts to complete transactions.

These bots are traditionally difficult to identify as they will disguisethemselves as human users, using human user-agent identificationand following the same process as human users. They are also onlymaking a relatively small number of requests, unlike scrapers thatare making hundreds or thousands of different requests. These kindof low-volume attacks are easy to hide within the general highthroughput of a busy on sale.

Big Business

The secondary ticketing market is estimated at $15 bil‐lion in the United States alone, with some estimatessaying that 70% of tickets sold via these platforms aresold by touts or professional traders. A separate UKgovernment report estimated that only 20%–40% oftickets for a major concert would be sold directly tofans.

Scalping is a well-established industry with bots available for pur‐chase designed to complete purchase on specific sites. These will

28 | Chapter 6: Checkout Abuse

include not only the automation of the purchase process (includingbypass of protections such as CAPTCHA) but also the tracking ofticket purchases and automatically posting to re-sale sites and themanipulation of the PDF ticketing afterward to reflect the details ofthe new purchaser.

There has also been a recent growth in online services that allowindividuals to pay for the service to attempt to purchase specificitems from a limited release sale via the use of automated processes.

Ticket scalpers are more of a publicity and brand protection concernthan a profitability concern, with users getting unhappy becausethey miss out on ticket purchases, especially when the same ticketsare available for re-sale at a marked-up price. Such situations lead tobad publicity and often the artist whose tickets are involved willrespond to the situation, including on some occasions cancellingindividual tickets seen as available for re-sale or even completelycancelling shows. This in turn leads to demands for political solu‐tions and regulation to be put in place in several countries aroundthe world.

For this reason there are a lot of innovations in the ticketing indus‐try that aim to make the transfer of tickets after purchase muchmore difficult. It is yet to be seen whether these methods (or legalchanges) will manage to bring this situation under control. Theseissues and responses are becoming more common in other indus‐tries.

SpinnersSpinners are an evolution of scalping bots but are a more insidiousversion. Rather than completing the purchase process they hold thegoods in a basket or equivalent, knowing that the product willremain assigned to them until the transaction is completed. Whileholding it in this state they advertise the product for re-sale onanother site and, only if the product is re-sold do they complete theinitial transaction.

The most common use of spinners is for ticket purchases, meaningthat ticketing sites are reporting shows as sold out before all the tick‐ets have actually been sold, meaning that the touts do not have totake the risk of paying for tickets up front until they know they havea buyer lined up.

Spinners | 29

Spinner bots will be specifically created for target systems to takeadvantage of weaknesses in the business logic, which has beendesigned to prevent users from losing out on tickets.

Inventory ExhaustionInventory exhaustion (also called inventory grabbers) is an even lesspalatable version of a spinner bot. Like a spinner bot it aims to cap‐ture inventory within baskets and hold it there, but unlike a spinnerbot, it does so with no intention of completing a purchase.

In this case the intention is simply to grab limited availability goodsand remove them from availability for anyone else, eventually lead‐ing websites to report those products as out of stock. This will typi‐cally be implemented by competitors who have identical productsavailable, usually at a higher price. Rather than price scraping andensuring that they are selling at a lower price than their competitor,they force their competitor to report items as out of stock and carryon selling at a higher price. This is often combined with other activi‐ties such as affiliate fraud (discussed in Chapter 10).

This behavior is clearly not in the interest of either website owner orconsumer.

SnipersSnipers are automated processes that monitor time-based onlineprocesses and submit information at the very last moment, remov‐ing the opportunity for other people to respond to that action. Themost common example is last-second bidding on an auction item.

While it is true that a human user could manually carry out thissame action, usually automated processes can complete it closer tothe deadline, beating a human competitor.

From a site owner’s perspective, this activity creates two issues:

• It reduces the level that an auction could possibly have reachedif bidding had been carried out without sniping.

• It is usually seen as being unfair competition by human users ofthe site who will struggle to win an auction.

30 | Chapter 6: Checkout Abuse

To counter sniping activity, some auction sites now extend the timeof the auction if there are any late bids, to give competitors thechance to respond.

Discount AbuseThis is a variation on scalping that, rather than focus on high-demand items, focuses on items that are being discounted and com‐plete purchase with the intention of reselling at a higher price. Thisis commonly seen across retail and ticketing.

This will involve running scraping technology as described earlier toregularly scrape site content to identify when items have been dis‐counted and then use scalping bots to complete purchase.

Again, the impact on the retailer is mixed. On one hand, it is a sale,but it is a sale at a lower value than could potentially have been com‐pleted. The discount was aimed at attracting a certain type of userwho was not willing to pay full price, the ultimate purchaser mayhave been willing to purchase at full price.

Discount Abuse | 31

CHAPTER 7

Credit Card Fraud

Automated traffic categorized as credit card fraud is straightforward—it is the use of automated processes to fraudulently take advantageof credit card data.

As previously mentioned, there is often overlap between these activ‐ities and those of checkout abuse (discussed in Chapter 6) but theseactivities are separated due to the explicitly fraudulent intentinvolved.

Many of the processes put in place to protect against non-automatedcredit card fraud (such as Verified by Visa, 3D Secure, and post-transaction fraud checking services) will also act as protectionagainst automated card fraud.

Good or Bad?There is no disputing that these activities are universally bad. Theobjective is to use your site to process payments using credit cardswithout the cardholder’s permission, therefore making you liable tochargebacks.

Let’s now look at some specific examples of credit card fraud bots.

Card ValidationOften shortened to carding, this is the attempt to process small valuetransactions to validate the continued validity of a set of credit card

33

data, usually before selling that card data on at a higher value orusing that card for higher value transactions.

Technically these bots use the same technology as scalping bots andtake the same precautions to cover their tracks to avoid detection.

Commonly lists of card data will be processed in this way through asmall number of known sites (usually sites that offer low-value pur‐chases with limited traceability). Unusual spikes in these types oftransactions can indicate that your website has been targeted withthis activity.

Card CrackingCard cracking is a variation of carding but where the complete set ofcard data is unknown. In this case the automated traffic will attemptpurchases using variations of the missing data in order to get a com‐plete set of card data for re-sale or re-use elsewhere.

Spikes in rejected credit card transactions and retries by users canindicate that this kind of activity is happening on your site, thoughbot operators are usually sophisticated enough to distribute thefailed attempts across multiple separate websites.

Although traditionally used for cracking credit cards, an easier tar‐get can be cracking the codes allocated to gift cards.

Card FraudAlso known as cashing out, this is the completion of a purchase witha fraudulent credit card. Unlike card validation, this will be carriedout with the intention of fraudulently obtaining goods or servicesfor personal gain.

Most websites are mandated in having protection against this activ‐ity as part of the compliance insisted on by payment providers.

34 | Chapter 7: Credit Card Fraud

CHAPTER 8

User-Generated Content(UGC) Abuse

Rather than attempting to capture data from your site, this class ofautomated traffic aims to use your website to publish contentwithout your consent, via public areas of your site that allow user-generated content.

There is a separate class of bots that aims to inject content into yoursite that will exploit security weakness within your site (such ascross-site scripting or SQL injection). These are not discussed asthey should be captured by good programming practice or othersecurity devices such as web application firewalls (WAFs).

Good or Bad?While the negative impact of content injection is not as severe asmany of the other types of automated traffic, it is all negative. Theobjective of the areas that they are targeting is to capture and dis‐play content from users and that content should add value to thesystem as a whole. When bots automate injection of content theywill be adding content with a different objective, at worst this canbe promoting competitors or goods and services that you wouldnot want associated with your brand or which may encourage usersto leave your site.

35

While there are a number of different types of UGC abuse bots, theyare all essentially carrying out the same activity but targeted at spe‐cific areas of a site. For example, a forum spammer will target post‐ing to a forum and a comment spammer will target postingcomments on an article or product—although, they have differentobjectives, they are ultimately very similar, which is why I havegrouped them all together in the description that follows.

Content SpammerThis is a specific type of automated traffic that will look to injectmessages into your user-controlled areas of your site, such asforums, guestbooks, bulletin boards, and reviews or comments sec‐tions associated with articles.

At its simplest level these will appear as spam content that will seemout of place with the surrounding content, typically encouragingusers to click to a third-party site offering a separate service/prod‐uct. These are relatively easy to deal with a manual process of moni‐toring and deleting.

However, the intelligence behind these types of bots has increased.They are now more focused, and can analyze the content of articlesand other comments and insert a comment that is much less appa‐rently spam and triggers a recommendation for an alternative prod‐uct in a manner that more realistically mimics a genuine user.

For example, in an article about Product A, a comment could bemade saying, “I’ve used Product A and Product B before and alwaysfound Product B to be much more reliable.”

In this case the removal of spam comments becomes much moredifficult, as it can lead to the deletion of legitimate comments and inturn lead to negative feedback on censorship.

36 | Chapter 8: User-Generated Content (UGC) Abuse

CHAPTER 9

Account Takeover

This category of automated traffic is focused on trying to gain con‐trol over user accounts within your system, which could be controlof existing accounts or creation of new accounts. This has overlapwith the security threats I have excluded from the definition of auto‐mated traffic for the purposes of this discussion, with the differencebeing that these activities are completed using entirely legitimateprocesses but to achieve illegitimate aims.

Good or Bad?The bots identified in this section are all bad. There is no legitimatepurpose or anything in the interests of your business that is beingachieved by these activities.

Let’s look at some examples of the types of activity undertaken byaccount takeover bots.

Credential Stuffing/Credential CrackingThe two most common forms of account takeover are variants inattempts to log into existing user accounts.

Credential cracking (also known as brute-force attacks) is trying mul‐tiple username and password combinations until a successful com‐bination is discovered. Usually this means using known usernames

37

or email addresses and then combining them with dictionaries ofthe most commonly used passwords.

Credential stuffing is an alternative approach that involves takingknown lists of email and password combinations and seeing if theyare also valid for alternative sites. It is a known weakness of manyindividuals’ approach to security that they will use a common pass‐word across multiple systems.

Both these actions aim to gain access to an existing user account forthe following reasons:

• To use the services offered by that website to logged-in users(e.g., behind paywall content).

• To extract money or other financial value items (e.g., loyaltyrewards or in-game features) from within that account.

• To harvest personal data for use/sale elsewhere.• To validate the username/password combination as a known

working combination, adding value to it when putting it up forre-sale.

It is important to note that the latter two of these options requirethat the visit be kept hidden, so as not to alert account owners thattheir accounts have been compromised, which would trigger peopleto take protective action against further attacks. This means that youmay be unaware that any account takeover activity has been success‐ful.

The value in account takeover, even if your website does not provideanything that you think would be of direct value to a third party,means that these attacks are common across all sites with loginfacilities (some research suggests that on any given site, 90% of alllogin attempts are ones aimed at account takeover).

Account CreationRather than attempting to crack the password for existing accounts,this type of automated traffic will create multiple new accounts. Thiswill include automating the sign-up process and bypassing of anysecurity features plus completion of any additional validation (suchas receiving and clicking through from confirmation emails).

38 | Chapter 9: Account Takeover

Some of the activities that can be undertaken after these accountshave been created include:

• As the basis for bonus abuse by taking advantage of free trials orsign up bonuses given to new accounts.

• To have a range of accounts to use for an inventory exhaustionattack on sites where items can only be held by logged-in users.

Bonus AbuseBonus abuse is the triggering of free trials or other bonuses multipletimes by automated means. It is often combined with account cre‐ation to gain sign-up bonuses.

This kind of automated traffic is particularly a problem in industrieslike sports betting or other gambling websites that offer a free betwith sign-ups. By automating the sign-up and validation process,bot operators can keep creating accounts and taking the free betsuntil they manage to achieve a winning bet.

Bonus Abuse | 39

CHAPTER 10

Ad Fraud

This is the manipulation of advertising traffic to prevent ads beingshown to the people they are intended for at the time they wereintended to be shown.

Good or Bad?Ad fraud is nearly always bad, though sometimes the negativeimpact can be difficult to realize. The basic objective of advertisingis to try and present details of a product to the right person, in theright place, at the right time. If done correctly the advertiser isrewarded with increased income and the publisher is rewarded withan advertising fee. Introducing ad fraud makes both of these lesslikely to happen: advertisers are paying for their ads to be displayedto users who have no intention of purchasing their product andpublishers are less likely to get an advertising fee as a portion of theadvertising spend is going to to fraudulent actors, it is also bad forthe exchanges as they will often have to waive fees generated bytraffic identified as automated.

As discussed in “Click Fraud” on page 45, there are some situationswhere it is in the interests of the bot traffic to drive up the revenuesof sites that advertise with them, when those sites are driven them‐selves by advertising.

Ad fraud could, however, be seen as being good for the adexchanges as they benefit from the transactions between advertisers

41

and publishers. Most exchanges will claim to put in mechanisms todetect and remove automated traffic from all requests they receive.

Ad fraud is a multibillion dollar industry and uses very sophisticatedmethods in order to ensure that the maximum value is extracted andis often perpetrated using botnets, which means that the fraud canappear to be coming from real users.

Background to Internet AdvertisingUnderstanding the activities of ad fraud bots requires an under‐standing of how the internet advertising industry works. Entirebooks can be written about the underlying technologies and meth‐odologies used to match advertisers with publishers but this sectionwill offer a high-level introduction to the actors and processesinvolved to provide enough background to understand the activitiesof ad fraud bots.

At its core the internet advertising market is made up of advertiserswho want to pay for their product to be advertised to potential pur‐chasers and publishers who want to be paid to display ads to poten‐tial purchasers. Advertisers want to ensure that their ads are shownto the people they believe are most likely to purchase their productand will therefore pay the most to publishers who can make thathappen.

However, advertisers and publishers now rarely interact with eachothers in the complex, distributed world of internet advertising andan increasingly complex series of middle men have been introduced,usually based around automated processes. This means that adver‐tisers are increasingly remote from where their ads are being dis‐played and publishers have reduced ability to attract advertisersbased on their public reputation.

Traditionally, if you wanted to advertise you would have a personalrelationship with the agencies arranging the advertising. Some insti‐tutions with well-established reputations can still engage in direct adsales but the majority do not have that ability, and those that dochoose instead to fill any unsold inventory with ads sold via adexchanges. This is not inherently a bad thing, as it gives access tosmaller publishers to ad inventory, leading to an increasing democ‐

42 | Chapter 10: Ad Fraud

ratization of the industry, but an unwelcome side effect has been theincreased potential for fraud within the system.

The vastly expanded potential locations for advertising as the inter‐net has grown has meant that there are now an enormous amount oflegitimate places for an advertiser to consider advertising. There isno way that they can personally validate the reputation of thoselocations or their suitability as an advertising source.

This is where the ad networks and ad exchanges come in. They willtake the requests from advertisers and match them with relevantpublishers. They act as middle men between advertisers and pub‐lishers, aggregating supply and demand to remove the need fordirect sales between the two. Ad networks were the original version,selling bulk advertising on behalf of publishers to advertisers; adexchanges were an evolution and sold ads on the basis of a singleclick. It is worth noting that most ad networks are also adexchanges.

Ad exchanges work like this: a publisher has a request from an enduser to display a page from their website (the page contains a num‐ber of advertising slots). A request will then be made to an adexchange, including a cookie relating to that user and this cookiewill enable the ad exchange to extrapolate what it already knowsabout this user based on previous visits made to other sites withinthe network (visits may be to other sites advertising on the samenetwork or other sites that gather usage data and submit to the net‐work in order to give richer information about the user—known asaudience extension). The ad exchange will then submit this adrequest to an automated auction process known as real-time bidding(RTB) where the ad spot is sold to the advertiser who has bid thehighest for customers who match that profile.

However, as the number of ad networks and exchanges increased,both advertisers and publishers wanted to increase their reach andhave the ability to access multiple ad exchanges and ad networksdepending on the nature of the situation. This lead to the develop‐ment of demand-side platforms (DSPs) and supply-side platforms(SSPs). SSPs will act as the first point of contact for a publisherwanting to display an ad, and make the decision whether the usershould be displayed a pre-purchased bulk ad from an ad network orto go to an ad exchange for an RTB ad. The ad exchange will reachout to a range of DSPs to collect bids for the ad and return to the

Background to Internet Advertising | 43

SSP the details of the highest bidder. DSPs will be responding to bidsfrom multiple ad exchanges.

This is the process at its simplest but it is becoming increasinglycomplex. There are often chains of requests going via multiple adexchanges and DSPs to further distance advertisers from publishers.This has led to the practice of ad arbitrage, which is where someonewill buy a bulk amount of ads from an ad network and then sellthose ads via the exchanges in the hope that they can be sold for ahigher value via RTB than they purchased them for in bulk. Ad arbi‐trage is a controversial practice in the ad industry, as it is seen as away of agencies taking advantage of their clients.

This complex system is worth over $200 billion per year and thereare many ways that bot operators look to get some of that income.

Terminology

In the ad industry automated traffic is often referred toas IVT (invalid traffic).

Examples of the types of activity undertaken by ad fraud bots arediscussed in the following sections.

Banner FraudIn the early days of the internet, publishers had a simple model:charging by the number of times an ad was displayed (i.e., cost perview (CPV) or more commonly CPM (cost per thousand views).Bots could be easily used to add virtual “views” and artificiallyincrease the figures.

The common approach to this was to create a fake site that dis‐played ads and then use automated traffic to drive large quantities oftraffic to these sites. As the threat of fraud increased, ad platformsstarted building in validation around viewability, validating that thead was loaded and was visible to the user and was above the fold.

This meant that bot operators started using botnets as the source forad fraud, botnets typically involve compromised browsers on realuser computers and can therefore be loading pages in the back‐ground without the knowledge of the user.


Botnets for HireBotnets are available for hire—large numbers of machines can bemade available for short time periods to execute specified programsand are charged by usage.

Ad fraud is regarded as the biggest money-making activity for bot‐nets.

To deliver banner fraud with the optimal value it is essential to getthe maximum amount of ads possible while not violating the viewa‐bility checks. This is usually done in one of two ways:

Ad stackingMultiple ads are displayed on top of each other, meaning theyare in the viewable area but would not all be visible to humanusers.

Pixel stuffingA complete website is loaded within a single pixel iframe hiddenon the page, therefore loading all the ads in that hidden websiteas well.

Banner fraud is no longer an issue if basic best practice is followed.

Click FraudAs the advertisers sought to engage a direct response rather than apassive view, advertisers began to only pay for ads that the useractually clicked on, that is CPC (cost per click). This meant that thebots had to extend their functionality to mimic human clicks on anad being displayed. Many ad systems also look to validate humanmovement before the ad is clicked.

It is important to understand that the click fraud being perpetratedmay be beyond the site on which the fraud is actually taking place.Many content sites operate a simple arbitrage model where theymake money by paying low CPC rates to drive low-cost traffic totheir site and then display ads for a third site for a higher value. Thisthird site will potentially be operating a similar model to drive usersto a fourth site, or they may be actually looking to convert the usersto a purchase at this point. Each of these sites are more interested inengaged users as they will reduce bounce rates and therefore

Click Fraud | 45

encourage the advertising site to carry on advertising with them.This means that the bots who are accessing the fake sites at the bot‐tom of the chain will carry-on human-like usage on the sites thatthey click through to, including clicking through to their advertisersand acting human-like on those sites. This can result in legitimatesites having quantities of bot traffic that the bot operators do notdirectly benefit from but indirectly benefit from.

CPA FraudMany advertisers have moved to a cost per action (CPA) model thatignores the clicks and only pays out after an item is purchased or insome cases after the item is added to the basket or a form is submit‐ted.

In these cases the automated traffic can complete the necessaryaction, often in combination with data obtained by other automatedmethods such as Carding.

Cookie StuffingThe sophistication of ad fraud does not just extend to making auto‐mated traffic activity look valid after the ad is requested and dis‐played—it begins before the ad is requested.

The value of an advertising space varies dramatically depending onthe demographics and previous browsing history of the person whothe ad will be displayed to. This data is captured within a cookiestored on the user’s machine and is updated whenever the user visitsany site within the network, including sites that don’t advertise butjust share data with the ad network. Advertisers will pay extra if theyknow a user is actively looking for certain items, and display com‐petitive ads with competitive pricing right at the point of sale.

Therefore if you are entering an auction for advertising revenue, itmakes sense to do so masquerading as a very desirable user. Botoperators will therefore use cookie stuffing to ensure that the cookiethey present is as relevant to the advertisers as possible and thereforebrings the biggest revenue.

There are generally two approaches to do this:


Cookie creationIn this method bot operators will automate the creation of abackground to a user before they request an ad; this can includevisiting a number of different sites and undertaking actions thatwill increase the value (long visit times, adding to basket, etc.).The bot operators exploit the higher rates payed out by advertis‐ers and repeatedly go onto target websites and load products inthe cart, just to get the advertising conversion somewhere else.In doing so they not only invalidate your cart conversion rates,but can lead to inventory exhaustion and cart abandonment bylegitimate buyers.

These cookies can be created using server-based automated traf‐fic or from background tasks on botnets. This means that as anadvertiser there may be large amounts of bot activity on yoursite building up profiles for future ad fraud against yourself orone of your competitors, removing this fraud will reduce yourad spend and increase the quality of the click throughs.

When studying the activity of botnets this method can be seenseasonally as appropriate cookies are created for the time of yearor specific market that is being targeted.

When created, this cookie can be shared across multiplemachines in the bot net to maximize the value of the investmentin creation.

Cookie copyingBy contrast, this method involves taking advantage of a cookiethat has been created by a real human user. Typically this isgathered by malware on user machines that tracks their usageand shares the cookies they create, possibly augmenting thosecookies with automated activity to increase the value.

The advantage of this approach is that these cookies are veryhard to detect as bot activity, as they are based on actual humanusage.

Affiliate FraudAffiliate networks are a variation on a CPA system but rather thanad based, they are separate websites that link through to a target siteand receive commission for any sales achieved from referrals. Affili‐ate fraud sites will masquerade as legitimate sites recommending

Affiliate Fraud | 47

products with links through to sites where they are registered asaffiliates. Payment may be on click through but really valuablereturn is usually on sale.

For highly desirable items, such as fast-moving consumer goods, thecommissions payable for sales can add up to significant sums.

The bot networks choose the goods with the highest CPA, and per‐form a comparative search for the same item among competing ven‐dors which they then display as an ad. These ads are typicallybought in bulk for a low cost, gambling that the value of the transac‐tion will exceed that of the ad spend. However, bot operators don’tlike to gamble so the bot networks have to ensure they can actuallyget the sales and claim the reward. In order to guarantee this theysimply game the system—by excluding competitors, mis-representing prices, or in the most malicious cases using inventoryexhaustion methods to show that no stock is available from thecompetitors. The transaction could be completed by credit cardfraud but an affiliate with large amounts of chargebacks would soonbe discounted from any affiliate scheme as undesirable.

Arbitrage FraudMost of the approaches just described have been based on bot oper‐ators acting as a publisher and taking revenue from advertisers. Thisrequires the overhead of the creation and maintenance of largenumbers of publisher sites and the necessity to limit the profile ofthose sites so as not to draw attention to then and keep under theradar of the ad exchanges.

An alternative approach is to act as a middle man, buying ads cheapand sell them at a marked-up price. Ad arbitrage is the process ofbulk purchasing ad inventory from a publisher and then reselling itfor a higher value. There are companies that specialize in this areaand will offer “value add” to the ad inventory that they buy in orderto achieve the additional sell price. Ultimately these companies aretaking a risk in buying inventory that they do not know for certainthey will profit from.

Just as with affiliate fraud, botnets do not like to gamble so will gamethe system to ensure that they win, usually by buying inventory theyknow is available cheaply and then manufacturing high-value trafficto satisfy that inventory.


A typical process is to use automated scrapers to identify areas of anadvertiser’s inventory that have low demand and therefore low bulkbuy ad costs and purchase ad inventory from publishers. Bot opera‐tors will by then have used cookie stuffing to create high-value usersand will resell the ad inventory to the advertisers to display ads tothose high-value users.

This is where the chained nature of the ad network can be helpful, asthe bot operator in this situation is acting as both SSP and DSP inthe same transaction. The publisher requests an ad, this is satisfiedby the bot operator under the bulk buy agreement, instead ofreturning an ad to display however, the bot operator then makesanother request to the ad exchange to get the ad via RTB from theadvertiser.

Chaining of ad buys like this is not uncommon and is difficult totrack from stats packages in order to identify pattens of bot activity.

Arbitrage Fraud | 49

CHAPTER 11

Monitors

While the majority of automated traffic types are triggered by oth‐ers, there is also a category that is usually under your control. This isautomated monitoring software that executes regular, scheduledrequests to validate aspects of system behavior.

Good or Bad?Generally these are examples of good automated traffic, which isdirectly arranged by website owners in order to validate elements ofthe performance or availability of their systems.

However, there are some examples of other organizations thatarrange monitoring traffic on websites without the consent of thewebsite owner. There are a range of reasons why this may happen,including companies wanting to track performance of their com‐petitors, third parties gathering stats on a range of site activityacross a group of sites (e.g., industry vertical) or sales activity forcompanies gaining stats to use in pitches to potential customers.

Largely these external monitoring activities are harmless, at worstgiving competitors some insight into your site availability and per‐formance.

Let’s now take a look at some specific examples of monitoringtraffic.

51

AvailabilityAvailability monitoring usually takes the form of regular requestsfor a small range of pages to validate that those pages are still avail‐able, triggering alerting to site management that issues have beenobserved. This could be simple requests for a single page or mayinclude a series of page requests to replicate a representative userjourney.

These systems are universally well behaved, operating in the interestof site owners, limited to the nature of requests defined by site own‐ers, and do not make efforts to hide their existence as they will typi‐cally want to ensure that they are not blocked by any protectionmechanisms.

PerformanceAn extension of availability monitoring is performance monitoring.This gathers a much wider range of statistics in order to determinethe performance of a system. This may include completing a rangeof requests to validate a successful user journey through the systemand capture a wide range of statistics to assess performance of thesystem.

Like availability monitors, they are usually well behaved and openabout their activities.

OtherThere are a wide range of similar systems that will monitor differentaspects of the well being of a site. These include monitors such as:

• SEO monitors• Security monitors• Network monitors• Accessibility monitors

These will usually be open about who they are and the nature of theactivity they are undertaking.

52 | Chapter 11: Monitors

CHAPTER 12

Human-TriggeredAutomated Traffic

There is a new category of bot traffic that is currently emerging:automated traffic that is triggered by a direct human action via anon-browser interface. Increasingly there are new interaction para‐digms being developed as people move away from a standard desk‐top browser-based interaction. For several years mobile has been thedominant means of interaction but in recent years there have beennew innovations that look to define new voice and chat interactions,new devices (such as Apple Watch), and new intelligent assistants toaid the automation of our interaction with systems.

Good or Bad?Superficially these new interactions would present themselves as allbeing good and in the interests of consumers getting the best valuefrom your systems. However, as these systems evolve it could wellresult in some of the practices defined here being implemented inorder to deliver a good customer experience.

For instance (I am using Amazon Alexa as an example here but thesame ideas could be applied to other digital assistants or chatbots):

• “Alexa, buy me tickets for x tomorrow at 9 a.m....” would pro‐duce a result similar to ticket scalping.

53

• “Alexa, keep checking all the electronics sites until youfind x available for less than $y...” would produce a price scrap‐ing bot.

• “Alexa, get all the latest sports news from the Daily X, Y, andZ...” could end up close to our current description of contenttheft.

All of these would need skills to be developed to deliver this func‐tionality and are not things that couldn’t have been similarly devel‐oped for individuals to execute via installed applications or evenmobile applications, but the voice/chat-based interface seems topromote this type of activity.

This is an emerging element of the automated traffic landscape andin the coming years will be an important consideration when creat‐ing a bot management strategy.

54 | Chapter 12: Human-Triggered Automated Traffic

PART III

How to Effectively HandleAutomated Traffic in

Your Business

As identified in Part II, there are many types of automated trafficthat may be accessing your website (with varying intents, from thebenign to the malicious). As a website owner you will naturally wantto start putting in place some measures to take control over this traf‐fic and optimize the value you receive and minimize the negativeimpact.

The chapters in this part will go into more detail about the stepsnecessary to define and implement an effective bot managementstrategy.

CHAPTER 13

Identifying Automated Traffic

Any effective bot management strategy involves two separate steps:first, you must identify which traffic is non-human and within thatdefinition whether it is non-human traffic that you need to takeaction against. After you have made that assessment you will thenneed to decide what action to take.

This chapter discusses the first, and by far the most complex, ofthose two steps.

Indications of an Automated Traffic ProblemA common question website owners will ask is “Are there any obvi‐ous symptoms I can look at to see if I have a bot problem?”

Keep in mind that automated traffic is often doing its best to makeitself not noticed, so there may not be any obvious symptoms.

However, if you see any of the following things on your site it doessuggest that a more thorough investigation of automated trafficcould be worthwhile:

• High infrastructure usage that is not associated with known/expected increases in traffic

• Traffic patterns that are unusually spiky or do not follow thepattern of usage that you would expect

• Increases in visitor numbers or infrastructure usage that is notassociated with an increase in conversions

57

• Increased numbers of failed payments• Increased numbers of failed logins• Increased numbers of chargebacks• Increased levels of IVT reported by ad exchanges and SSP

ChallengesThe challenge of identifying non-human traffic is simple: there is nodifference between a request made by a human and that made by anon-human user. Each is just a standard HTTP request.

HTTP Requests

All requests to a website, whether made by a browseror by another automated program, are ultimately just anumber of HTTP requests.The HTTP protocol is a simple protocol that forms thebasis of the web as we know it today. It works by a cli‐ent making a request to a server which returns aresponse. The request includes details of the file that isbeing requested and a set of headers to provide infor‐mation the server may need to effectively execute therequest.The original intent of HTTP was to provide a simpleprotocol for requesting static content from a webserver. Therefore, it was designed to be a simple state‐less protocol—that is, each request is entirely inde‐pendent of any previous requests. However, as websitecomplexity grew, the requirement for tracking multiplerequests from the same user emerged as essential foreffective website delivery. For this reason conventionsaround the use of HTTP headers grew between serverand browser makers (such as session tracking andcookies).It is important to remember that these are just conven‐tions and an HTTP request can be made without theuse of a browser and HTTP headers can be set to anyvalues.

On an individual HTTP request there is a very limited amount ofinformation that can be used to make an assessment.

58 | Chapter 13: Identifying Automated Traffic

IP addressCan be used to identify where the request has been initiatedfrom

User agentA string that is included within the request for the requestor toidentify themselves. A typical user-agent string sent by abrowser request would look like Mozilla/5.0 (Macintosh; IntelMac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/63.0.3239.108 Safari/537.36, this includes details aboutthe browser being used (Chrome v63) and the platform andoperating system. However, the content of this string is definedby the requestor and can be set to any value.

User-Agent String ComplexityAs the preceding example illustrated, the standard user-agentstrings sent by browsers are much more complex than would seemto be necessary. The reason for this can be traced back through thehistory of web development.

As early browsers started diverging in functionality, developersstarted using user-agent strings to identify which functionalitywould be available and varying the response based on this assess‐ment. Newer browsers therefore started identifying the browsersthey were compatible with within the user-agent string to ensuretheir users saw the full range of functionality. Hence in the Chromeexample just shown it is announcing that it is compatible withMozilla, AppleWebKit, KHTML, Gecko, and Safari!

For a more detailed description of the complex history of user-agent strings, see https://webaim.org/blog/user-agent-string-history/.

Other HTTP HeadersAdditional pieces of text passed as part of the request. The mostcommonly used of these is cookies. Again all of these are withinthe control of the requestor to set the values.

The protocol gives automated traffic the ability to correctly identifyitself but makes it easy for automated traffic to appear as human.

There have therefore been a number of methodologies applied overthe years to identify automated traffic and these can be broadly clas‐sified as going through four generations.

Challenges | 59

https://webaim.org/blog/user-agent-string-history/

Generation 0: Genesis—robots.txtThe first generation of bot defense was the definition of a publicpolicy associated with a website that defined how you would like bottraffic to behave. This is the robots.txt file that can be placed in theroot directory of any website.

robots.txt

The robots.txt file was first proposed by Martijn Kosterin 1994 after a web crawler caused an accidentaldenial-of-service attack on his server. It soon became ade facto standard.

robots.txt is a text file that contains a list of instructions to automa‐ted traffic on which files they shouldn’t access. These instructionscan be applied to all traffic or to traffic related to a specified useragent.

The majority of legitimate automated traffic (including all majorsearch engines) observe robots.txt policies.

I would categorize this as Generation 0 as it is not a form of botdefense due to having no means of enforcement—it is simply a listof instructions that it is up to the requestor whether or not theyfollow.

Generation 1: Simple Blocking—Blacklistingand WhitelistingThe first generation of actual bot identification focused on using thedata that is made available within HTTP headers, such as the IPaddress and user agent, in order to create blacklists and whitelists ofknown “good bots” and “bad bots.”

At its simplest level this will be a purely reactive policy. Reviewinglogs and monitoring output to identify any unwanted activity andadding the IP address or user agent associated with that activity tothe relevant blacklist.

The reactive nature of this approach is limited, however, as it willalways miss the first instance of a problem. Therefore, better prac‐tice is to collect known blacklists and share them across multiple


users, essentially crowdsourcing the knowledge of known good orbad actors.

These shared blacklists are available as open source communityprojects or from commercial organizations. They formed the basisof most early bot defense services and still form an important part ofthem to this day.

However, this form of defense is easily bypassed. As mentioned, theuser agent is simply a string that is defined by the client as part ofthe request, therefore any automated traffic can send whatever valuethey want, including the same value as sent by a legitimate browser.Most legitimate automated traffic will correctly identify themselvesusing a user-agent string, although some (such as Google) will alsosend hidden requests to validate that the responses being sent totheir search bot is the same as that being returned to human users.

Likewise, the growth of cloud providers and other services that willgive access to wide ranges of rotating IP addresses, as well as the cre‐ation of botnets, have made sending requests from a wide range ofIP addresses much easier. In this situation simply blacklisting IPaddresses can have unexpected consequences as they can turn out tobe used in the future by legitimate users.

Generation 2: Early Bot Identification—Symptom MonitoringTo handle some of the limitations of early bot identification, a newapproach was taken. This involved looking at the activity seen onthe server and identifying symptoms that may be representative ofbot activity.

A very simple example of this would be to say that if a lot of requestsare being made from a single IP address then this would be a symp‐tom of bot activity. Therefore a rule should be applied to say that nosession not identified on the whitelist of known “good bots” shouldbe allowed to make more than x requests during a specified periodand any requests above that limit will be rejected.

Symptoms can be related to server behavior or can be obtained byaugmenting the IP address data or user agent data with additionalinformation (e.g., the country the IP address is located within).

Generation 2: Early Bot Identification—Symptom Monitoring | 61

Other examples of the sort of data that can be looked at include:

• Whether the IP address is located within a data center.• Whether the user agent relates to a recent version of browser

software.• Whether users are downloading the full range range of files

(HTML, JavaScript, CSS, images, etc.) or just HTML files.• The regularity of the pattern of requests being made.

The limitations to all these approaches are as follows:

• They provide no understanding of whether the automated traf‐fic is good or bad and what the intent of the bot operator is.This is reliant on the existing whitelists populated in simpleblocking solutions.

• They do not validate that the user is not just a human user thatis interacting in a similar manner to a bot.

• The settings being applied are easily identifiable and thereforethe bot operator can adjust their approach to bypass them.When looking at the interactions from more sophisticated botsit is common to see them test systems to determine which miti‐gations are in place—for example, trying different rates ofrequests to validate that there is no rate limiting in place and ifthere is then determining the optimal rate to make requests at.

Generation 3: Improved Bot Identification—Real User ValidationThe next generation of bot detection was focused on putting meth‐ods in place to try and validate that there was a human involved,using a real browser and to track responses that are from the samephysical device, regardless of variations in IP address and user agent.

This generation was centred around two dominant approaches: realbrowser validation and fingerprinting.


Real Browser ValidationA group of techniques that are designed to ensure that the user is ahuman user, using an actual browser, rather than an automated toolor a headless browser.

This includes techniques such as the following:

Cookie injectionInjecting cookies into the response and validating that they existin subsequent requests. This will validate that the bot is not justexecuting a predefined series of requests

JavaScript injection/honeypot trapsInjecting some JavaScript that will update cookies (or poten‐tially other elements of an application) with the result of theJavaScript execution. This validates that the page has been pro‐cessed by a platform capable of executing JavaScript.

Real-user trackingTechnology that validates that user interaction has occurred onthe client side by tracking mouse movements and other means.

FingerprintingFingerprinting is the process of identifying a much wider range ofattributes relating to the machine executing the request and trackingwhen that same combination of attributes is associated with futurerequests. There are multiple means of implementing this approach;one example is to use JavaScript to transmit a beacon to one or moreof a range of remote services which includes details of the remotefingerprint.

In this way much tracking can be put in place to see when people arepotentially trying to hide their activity behind multiple IP addressesor by using anonymous proxy systems. This leads to more accuracythan relying on IP addresses alone.

Fingerprinting has been implemented that identifies hundreds ofdifferent identifiers to make tracking more accurate.

This generation of improved bot detection made it a lot more diffi‐cult for unsophisticated bots to operate, requiring a higher level ofsophistication to either execute requests from within an actual

Generation 3: Improved Bot Identification—Real User Validation | 63

browser or to employ sophisticated approaches to replicate operat‐ing with in browser.

Developments in browser automation, combined with the capacityof clients to execute and low-cost cloud computing, have made thissophistication a lot more available to bot operators.

Generation 4: Sophisticated Bot Identification—Behavioral AnalysisThe identification of bots had advanced considerably by this timebut still faced two major problems:

• Bots were getting ever more sophisticated and were able tomimic human, browser-based interactions.

• The means to distinguish between good and bad bots was still ablunt tool as implemented by the bot detection tools of the timeand had little appreciation of the variance in definition of goodand bad from industry to industry and company to company.

The next generation of bot detection set out to answer both of theselimitations and did so using the technology of behavioral analysisand machine learning.

This approach says that rather than looking only at the external datayou have about a request or the impact it is having on the server,you need to understand the activities that users are undertakingwithin your site. Only by understanding their behavior can youunderstand their intent. Once you understand the intent of the botoperator then you can effectively categorize automated traffic andtake a much more nuanced management approach.

These machine learning, behavioral approaches will typically com‐bine three levels of intelligence in order to make an effective assess‐ment of whether users are automated or not and what their intent is.At the lowest levels these will be patterns that can be applied univer‐sally across all websites. These will then be augmented with patternsthat are industry or bot category specific and lastly be augmentedagain with site specific patterns. All these will then be combined intoa learning engine that will be constantly evolving as new data, bothhuman and automated, is processed.


The major difficulty in any effective bot identification process is thatthere is no fool-proof way of determining, even after the event,exactly which requests are bot and which are human, especiallywhen looking at the most sophisticated bots which are actively try‐ing to disguise their activity as human or even integrating theiractivity with that of human users via botnets. This means that as theindustry has moved through the generations of bot detection theidentification has become increasingly an assessment rather than anidentification. When moving into machine learning this becomeseven more so as essentially the analysis engine is forming an opinionsuch as “Based on the range of evidence provided and taking intoaccount the experience I have of seeing all other users on this sys‐tem, I would assess that this user is an automated process that isattempting to scrape content from your site”.

For this reason, the leading bot detection systems will also providean indication of the confidence they have in that assessment. Thisallows the site owner to vary the means of handling bot activitybased on the confidence level, balancing the risk of the negativeimpact of bot activity against the risk of negatively affecting theexperience of a legitimate human user.

The sophistication of modern automated traffic is such that only thisgeneration 4 level of detection will be effective, especially as thesebots that are making the most effort to be hidden are the ones youare most likely to want to remove from your website.

Generation 4: Sophisticated Bot Identification—Behavioral Analysis | 65

CHAPTER 14

Managing Automated Traffic

This chapter will address the final step of the process. Once youhave an understanding of the types of automated traffic and you’vesuccessfully identified which ones are impacting your site, whataction can you take against them?

As will be discussed in this chapter, there are a range of potentialactions that can be taken against bots and it is important that youselect the most appropriate means of handling bots while taking thefollowing points into consideration:

• The impact of false positives (i.e., wrongly identifying a humanuser as a bot): Is there a way for a human user wrongly identi‐fied as a bot to resolve that error, and if so, could a bot use thesame resolution? The consequences of allowing a bot throughmight be so negative that you are willing to risk a number ofhumans being impacted to preserve the security of your system.

• The ability of bot traffic to bypass that means of defense.• The signal you want to give to the bot operator: Do you want

them to be made aware that you have detected them, or will youchoose instead to let them assume that their request was suc‐cessful? Making them aware can have two disparate conse‐quences: on the one hand, they might move on to someone elseonce they realize you are no longer an easy target, but on theother, it might motivate them to modify the sophistication oftheir bot to try and bypass your defenses (this will largelydepend on how targeted the particular attack is).

67

These considerations will allow you to define a bot-handling policy.Ideally this policy will vary depending on the type of bot traffic thatyou are identifying, enabling you to take a different approachdepending on the threat versus opportunity assessment of the botidentified.

BlockingThe obvious thing to do with any bot identified is just to drop theconnection. This is common practice for firewalls and other securitydevices.

There are two ways in which blocking can be applied. The firstoption is to stop processing the request and return an HTTP errorcode (typically “403 Forbidden”) to indicate to the requester whyrequest was not completed. The second option is to silently drop therequest, closing the connection with no response returned.

Blocking is an effective policy when it comes to minimizing theimpact of bot traffic and the potential risk of opening up your sys‐tem to threat of things like application-level DDoS, which is why itis usually employed by security devices.

However, when looking at a bot management strategy, there aresome negative aspects of using blocking.

The feedback given to users is very limited: they will see an errorpage with a limited (if any) explanation of why the error has occur‐red—in the worst case, this will just be a standard browser failedconnection error page. This is acceptable in the situation where youare confident that the automated traffic identification you havemade is definitely accurate. If there is any element of uncertaintyand chance that you may have incorrectly identified a human userthen an ideal policy will give the user a means of being aware of thatand how to rectify that mistake.

Blocking bots also sends an obvious message to bot operators thatyou have identified them and have put a management strategy intoplace. This can just trigger the introduction of more sophisticatedapproaches to bypass your identification systems.

68 | Chapter 14: Managing Automated Traffic

Validation RequestsRather than simply blocking requests, to remove the risk of falsepositives, a second approach is to replace the response requestedwith an alternative response that gives the user the chance to iden‐tify themselves as human and therefore incorrectly identified asautomated traffic.

Validation can be implemented offline or inline.

Offline ValidationOne approach is to provide the user with a web-based form, allow‐ing them to submit their details and information about themselvesto a system administrator. The system administrator will then assessthe details submitted and determine whether an update to the sys‐tem rules is needed to allow further connections received from thisuser to access the system.

The limitation of an offline validation process is that there is a largehuman overhead but this does mean that there can be a humanintelligence applied. This can be especially useful if managing largeblacklists.

A means of handling feedback of inaccurate bot identification isnecessary, even if not provided via a dedicated web form.

Equally negative, from the human user point of view, is that it will bea slow process to get this situation resolved, meaning that thispotential customer will likely not return to your website.

Conversely to those negatives, this is likely to be a deterrent to botoperators, as it is a more difficult process to bypass.

Inline ValidationInline validation, by contrast, aims to give human users the ability tovalidate themselves as not being bots within the page that isreturned, therefore gaining instant access to the system.

The validation will typically involving carrying out an activity thatcannot be completed by bot traffic. This usually involves completionof a CAPTCHA-type test.

Validation Requests | 69

CAPTCHA

CAPTCHA (or Completely Automated Public TuringTest to tell Computers and Humans Apart) is a termthat was coined in 2003 for tests that are designed to bepassable by humans but beyond the ability of comput‐ers.CAPTCHA tests will look to test the human intelli‐gence abilities of variation recognition, segmentation,and understanding of context. To replicate these, a com‐puter would have to use artificial intelligence.The most common form of CAPTCHA is displaying adistorted string of characters and numbers and askingthe human user to type in those characters. This typeof test has been around since the late 1990s.Recently, artificial intelligence has advanced such thatthe majority of these type of CAPTCHAs can now besolved by artificial intelligence processes. This has ledto a new generation of tests that run more advancedtests, such as answering questions about images (e.g.,“Which one of these images contains a vehicle?”) or bycompleting tasks (e.g., orienting a picture so that it dis‐plays properly on the screen).

The advantages of this from a user satisfaction point of view are evi‐dent. If a user can identify themselves as a human within a singlepage, it means they can access the website they want to instantly,thereby preventing the website losing that customer. This meansthat a more strict approach can be taken in which suspected bots arechallenged.

However, that does mean that it is easier for bot traffic to bypass theprotection by identifying itself as human. This can be done in twoways: first, by using systems such as artificial intelligence to pass anytests that are put in place; and second, by using human users to passthe tests and allowing bot activity to continue on the back of thisidentification. There are third-party companies that provide banksof people that will pass CAPTCHA tests on demand in return forpayment (typically fractions of a penny per successful CAPTCHAcompletion).


Alternative Servers/CachingThe bot management approach you employ will be determined bythe problem that automated traffic is causing for you. In some casesthere may not be a requirement to stop the traffic, only to reduce theimpact it is having on your infrastructure and therefore cost andperformance for other users.

In this case there are several approaches that can be employed todeliver to users suspected of being automated traffic. These look atidentifying alternative sources for the data that is being requested.

One approach is to just serve cached content to connections identi‐fied as automated. This could be from a caching server (e.g., CDN)or from within your web application. The viability of this approachwill depend on the nature of the data being requested and theimpact of serving potentially stale data to legitimate users.

A second variation is to redirect all traffic identified as automated toa separate system (or area of the main system), thereby ensuringthat there is minimal impact on human users from the overhead ofhandling large amounts of automated traffic. This approach focuseson minimizing the impact on other users and infrastructure costs—it makes no effort to mitigate any of the threats that may be presentin automated traffic.

Alternative ContentThe final approach that can be taken, and in some ways the mostproactive way that actually takes the fight back to the bot operators,is to start serving alternative content to bot traffic.

The most obvious examples of this would be serving alternative pri‐ces to traffic you identified as competitor price scrapers or un-approved affiliates.

This offers two benefits: first, it will mean that the bot operators aremaking business decisions based on inaccurate data; and second, itgives you the ability to track the data after it has been served. If youstart to see the invalid content or data displayed on alternate systemsthen you can be confident they have been using your site as a sourceof data and take action based on that knowledge.

Alternative Servers/Caching | 71

Trap Street

In map making this concept is known as a “trap street.”Map makers would traditionally add fictitious itemsinto their maps, meaning that if they saw that item inany other map then the maker of that map had beencopying their maps.

This approach can be risky as if users are mis-identified, then invalidcontent could be served to legitimate users.

Past techniques for falsely improving search engine rankingincluded returning different content to search engine bots than thatreturned to real users, a tactic known as cloaking. Search engineswere obviously not happy with this, as it undermined the quality oftheir search results. They thus employed measures to detect cloaking(such as sending requests masquerading as non-search users andcomparing the results) and penalize sites where cloaking is detected.Any alternative content-based strategy must therefore be careful notto have an unwanted negative SEO impact.


Conclusion

There is no doubt that automated traffic is a major factor in themodern internet and estimates that this will continue to grow wouldseem to be accurate as there are very large amounts of moneyinvolved—the value of ad fraud alone in 2015 was estimated at $6.3billion.

Automated traffic activity is not conducted only by lone wolf hack‐ers—there are major, organized criminal groups who are dedicatinglarge amounts of time and effort to target specific systems wherethey feel they can optimize revenue. They have large automatedactivity scouring the internet for potential targets.

It is essential then that websites owners take this problem seriouslyand start to get an understanding of the amount of automated trafficthat is hitting their systems but much more importantly, what thattraffic is doing. As has been shown in this book there are a widevariety of different activities being undertaken for a wide variety ofpurposes and only by understanding this can you assess the risk/opportunity this traffic is presenting.

However, identification of this traffic and its intent is complex asautomated traffic is actively trying to stay ahead of any defenses thathave been put in place. Identifying automated traffic requires thelatest generation of machine learning and behavioral-based analysisand that process takes time and effort to do so reliably.

Only by taking this seriously and ensuring that you take control ofyour traffic can you ensure that you maximize the value you getfrom automated traffic, while minimizing the threat impact.

73

About the AuthorAndy Still (@andy_still) is cofounder of Intechnica, a vendor inde‐pendent IT development and performance consultancy and creatorsof TrafficDefender, the industry-leading website traffic managementtool which specializes in identification and management of automa‐ted traffic.

With over 15 years of experience in IT, Andy specializes in applica‐tion architecture for high scalability and throughput. Andy cur‐rently works on building Intechnica’s product range, which hassuccessfully maintained service for a wide range of companies dur‐ing peak events and also advises Intechnica clients on how to buildand maintain highly performant systems as well as on how to opti‐mize development processes and delivery.

Andy is the author of several O’Reilly books and is one of the organ‐izers of the Web Performance Group North UK. He blogs irregularlyat https://internetperformanceexpert.com.

http://internetperformanceexpert.com/

https://twitter.com/andy_still

http://www.intechnica.co.uk/

https://internetperformanceexpert.com/books/

https://internetperformanceexpert.com/

intelligent bot management | account takeover … · adaptive machine learning • behavioural...

Documents