when the dike breaks: dissecting dns defenses during ddos ...johnh/papers/moura18a.pdf · each of...

20
When the Dike Breaks: Dissecting DNS Defenses During DDoS (extended) USC/ISI Technical Report ISI-TR-725 May 2018 Giovane C. M. Moura SIDN Labs and TU Delft John Heidemann USC/Information Sciences Institute Moritz Müller SIDN Labs and University of Twente Ricardo de O. Schmidt University of Passo Fundo Marco Davids SIDN Labs ABSTRACT The Internet’s Domain Name System (DNS) is a frequent target of Distributed Denial-of-Service (DDoS) attacks, but such attacks have had very different outcomes—some attacks have disabled ma- jor public websites, while the external effects of other attacks have been minimal. While on one hand the DNS protocol is relatively simple, the system has many moving parts, with multiple levels of caching and retries and replicated servers. This paper uses con- trolled experiments to examine how these mechanisms affect DNS resilience and latency, exploring both the client side’s DNS user experience, and server-side traffic. We find that, for about 30% of clients, caching is not effective. However, when caches are full they allow about half of clients to ride out server outages that last less than cache lifetimes, caching and retries together allow up to half of the clients to tolerate DDoS attacks longer than cache lifetimes, with 90% query loss, and almost all clients to tolerate attacks resulting in 50% packet loss. While clients may get service during an attack, tail-latency increases for clients. For servers, retries during DDoS attacks increase normal traffic up to 8×. Our findings about caching and retries help explain why users see service outages from some real-world DDoS events, but minimal visible effects from others. KEYWORDS DNS, recursive DNS servers, caching, DDoS attacks, authoritative servers ACM Reference Format: Giovane C. M. Moura, John Heidemann, Moritz Müller, Ricardo de O. Schmidt, and Marco Davids. 2018. When the Dike Breaks:, Dissecting DNS Defenses During DDoS (extended) : USC/ISI Technical Report ISI-TR-725, May 2018 . In . ACM, New York, NY, USA, 20 pages. 1 INTRODUCTION DDoS attacks have been growing in frequency and intensity for more than a decade. Large attacks have grown from 100 Gb/s in 2012 [4] to over 1 Tb/s in 2017 [34], and 1.7 Tb/s in 2018 [18, 22]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ISI-TR-724b, May 2018 (updated September 2018), © 2018 Association for Computing Machinery. Such attacks are sourced from large botnets (for example, with Mirai peaking at 600k hosts [3]), fueled by the continued deployment of new devices. Gigabit-size attacks are commodities today, selling for a few dollars via DDoS-as-a-Service [44]. The Internet’s Domain Name System (DNS) is a popular target of DDoS attacks. DNS is a very visible target, since name resolution is a necessarily step in almost any Internet activity. Root DNS servers have seen multiple attacks over more than a decade [23, 33, 41, 42, 54], as well as threats of attacks [49]. Other authoritative DNS servers have also been attacked, with the huge October 2016 attack against Dyn [14] resulting in disruptions at a number of prominent websites, including Twitter, Netflix and the New York Times [34]. The outcome of these attacks on services has varied considerably. The October 2016 Dyn attack is noted for disruption to websites that were using Dyn as their DNS provider, and extortion attempts often include DDoS [35]. However, multiple attacks on the DNS Root have occurred with, as far as has been reported, no visible service outages [41, 42]. An important factor in DNS resilience is heavy use of caching— we believe that differences in use of DNS caching contribute to the very different outcomes when DNS is subject to DDoS attack. Yet understanding DNS caching is difficult, with requests traveling from stub resolvers in web browsers and at client computers, to recursive resolvers at ISPs, which in turn talk to multiple authoritative DNS servers. There are many parts involved to fully resolve a DNS name like www.example.com: while the goal is an IP address (an A or AAAA DNS record), multiple levels of the hierarchy (root, .com, and .example.com) are often on different servers (requiring NS records), and DNSSEC may require additional information (RRSIG, DNSKEY, and DS records). Each of these records may have different cache lifetimes (TTLs), by choice of the operator or because of DNS cache timeouts. We explore caching through controlled experiments (§3) and analysis of real-world use (§4). Another factor in DNS resilience is recursives that retry queries when they do not receive an answer. Recursives fail to receive answers occasionally due to packet loss, but pervasively during a DDoS attack. We examine how retries interact with caching to mitigate DDoS attacks for loss during DDoS attacks (§5) and their effects on authoritatives (§6). This paper assesses DNS resilience during DDoS attacks, with the goal of explaining different outcomes from different attacks (§8) through understanding the role of DNS caching, retries, and use of multiple DNS recursive resolvers. It is common knowledge

Upload: others

Post on 17-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

When the Dike BreaksDissecting DNS Defenses During DDoS (extended)

USCISI Technical Report ISI-TR-725

May 2018

Giovane C M MouraSIDN Labs and TU Delft

John HeidemannUSCInformation Sciences Institute

Moritz MuumlllerSIDN Labs and University of Twente

Ricardo de O SchmidtUniversity of Passo Fundo

Marco DavidsSIDN Labs

ABSTRACTThe Internetrsquos Domain Name System (DNS) is a frequent targetof Distributed Denial-of-Service (DDoS) attacks but such attackshave had very different outcomesmdashsome attacks have disabled ma-jor public websites while the external effects of other attacks havebeen minimal While on one hand the DNS protocol is relativelysimple the system has many moving parts with multiple levelsof caching and retries and replicated servers This paper uses con-trolled experiments to examine how these mechanisms affect DNSresilience and latency exploring both the client sidersquos DNS userexperience and server-side traffic We find that for about 30 ofclients caching is not effective However when caches are full theyallow about half of clients to ride out server outages that last lessthan cache lifetimes caching and retries together allow up to half ofthe clients to tolerate DDoS attacks longer than cache lifetimes with90 query loss and almost all clients to tolerate attacks resultingin 50 packet loss While clients may get service during an attacktail-latency increases for clients For servers retries during DDoSattacks increase normal traffic up to 8times Our findings about cachingand retries help explain why users see service outages from somereal-world DDoS events but minimal visible effects from others

KEYWORDSDNS recursive DNS servers caching DDoS attacks authoritativeserversACM Reference FormatGiovane C M Moura John Heidemann Moritz Muumlller Ricardo de OSchmidt and Marco Davids 2018 When the Dike Breaks Dissecting DNSDefenses During DDoS (extended) USCISI Technical Report ISI-TR-725May 2018 In ACM New York NY USA 20 pages

1 INTRODUCTIONDDoS attacks have been growing in frequency and intensity formore than a decade Large attacks have grown from 100Gbs in2012 [4] to over 1 Tbs in 2017 [34] and 17 Tbs in 2018 [18 22]

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page Copyrights for components of this work owned by others than ACMmust be honored Abstracting with credit is permitted To copy otherwise or republishto post on servers or to redistribute to lists requires prior specific permission andor afee Request permissions from permissionsacmorgISI-TR-724b May 2018 (updated September 2018)copy 2018 Association for Computing Machinery

Such attacks are sourced from large botnets (for example withMiraipeaking at 600k hosts [3]) fueled by the continued deployment ofnew devices Gigabit-size attacks are commodities today selling fora few dollars via DDoS-as-a-Service [44]

The Internetrsquos Domain Name System (DNS) is a popular target ofDDoS attacks DNS is a very visible target since name resolution isa necessarily step in almost any Internet activity Root DNS servershave seen multiple attacks over more than a decade [23 33 4142 54] as well as threats of attacks [49] Other authoritative DNSservers have also been attacked with the huge October 2016 attackagainst Dyn [14] resulting in disruptions at a number of prominentwebsites including Twitter Netflix and the New York Times [34]

The outcome of these attacks on services has varied considerablyThe October 2016 Dyn attack is noted for disruption to websitesthat were using Dyn as their DNS provider and extortion attemptsoften include DDoS [35] However multiple attacks on the DNSRoot have occurred with as far as has been reported no visibleservice outages [41 42]

An important factor in DNS resilience is heavy use of cachingmdashwe believe that differences in use of DNS caching contribute to thevery different outcomes when DNS is subject to DDoS attack Yetunderstanding DNS caching is difficult with requests traveling fromstub resolvers in web browsers and at client computers to recursiveresolvers at ISPs which in turn talk to multiple authoritative DNSservers There are many parts involved to fully resolve a DNS namelike wwwexamplecom while the goal is an IP address (an A orAAAADNS record) multiple levels of the hierarchy (root com andexamplecom) are often on different servers (requiring NS records)and DNSSEC may require additional information (RRSIG DNSKEYand DS records) Each of these records may have different cachelifetimes (TTLs) by choice of the operator or because of DNS cachetimeouts We explore caching through controlled experiments (sect3)and analysis of real-world use (sect4)

Another factor in DNS resilience is recursives that retry querieswhen they do not receive an answer Recursives fail to receiveanswers occasionally due to packet loss but pervasively duringa DDoS attack We examine how retries interact with caching tomitigate DDoS attacks for loss during DDoS attacks (sect5) and theireffects on authoritatives (sect6)

This paper assesses DNS resilience during DDoS attacks withthe goal of explaining different outcomes from different attacks(sect8) through understanding the role of DNS caching retries anduse of multiple DNS recursive resolvers It is common knowledge

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

that these factors ldquohelprdquo but knowing how and how much eachcontributes builds confidence in defenses We consider this questionboth as an operator of an authoritative server and as a user definingthe DNS user experience latency and reliability users should expect

Our first contribution is to build an end-to-end understandingof DNS caching Our key result is that caching often behaves asexpected but about 30 of the time clients do not benefit from cachingWhile prior work has shown DNS resolution infrastructure can bequite complex [48] we establish a baseline DNS user experienceby assessing the prevalence of DNS caching in the ldquowildrdquo throughboth active measurements (sect3) and through analysis of passive datafrom two DNS zones (nl and the root zone sect4)

Our second contribution is to show that DNS mechanisms ofcaching and retries provide significant resilience client user experienceduring denial-of-service (DDoS) attacks (sect5) For example about halfof the clients continue to receive service during a full outage ifcaches are filled and do not expire during the attack Often DDoSattacks cause very high loss but not a complete outage When a fewqueries succeed caches amplify their benefits even for attacks thatare longer than cache lifetime With very heavy query loss (90) onall authoritatives full caches protect half of the clients and retriesprotect 30 With a DDoS that causes 50 packet loss nearly allclients succeed although with greater latency than typical

Third we show that there is a large increase in legitimate trafficduring DDoS attacksmdashup to 8times the number of queries (sect6) WhileDNS servers are typically heavily overprovisioned this result sug-gests the need to review by howmuch It also shows the importancethat stub and recursive resolvers follow best practices and expo-nentially back-off queries after failure so as to not add fuel to theDDoS fire

Our final contribution is to suggest why users have seen rela-tively little impact from root servers DDoSes while customers fromsome DNS providers quickly felt attacks (sect8) When cache lifetimesare longer than the duration of a DDoS attack many clients willsee service for names popular enough to be cached While manywebsites use short cache timeouts to support control with DNS-based load balancing they may wish to consider longer timeoutsas part of strategies for DDoS defense Retries provide additionalcoverage preventing failures during large attacks

This technical report [25] follows the conference paper at ACMIMC 2018 [26] adding additional information in several appendicesto fill in details about some experiments and add additional sce-narios that supplement the paper The technical report was firstreleased inMay 2018 and was updated in September 2018 to includeinput from the IMC reviewers and shepherd

All public datasets from this paper is available [24] with ourRIPE Atlas data also available from RIPE [38] Privacy concernsprevent release of nl and Root data (sect4)

2 BACKGROUNDAs background we briefly review the components of the DNSecosystem and how they interact with IP anycast

21 DNS Resolvers Stubs Recursives andAuthoritatives

Figure 1 shows the relationship between three components of DNSresolvers stubs and recursive resolvers and authoritative servers

Stub Resolvereg OSapplications

Recursives(1st level

eg modem)

Recursives(nth level)

eg ISP resolv

AuthoritativeServers

eg ns1examplenl

Stub

R1aCR1a

R1b CR1b

RnaCRna

RnnCRnb

AT1 ATn

Figure 1 Relationship between stub resolver (yellow) recur-sive resolvers (red) with their caches (blue) and authorita-tive servers (green)

Authoritative servers (authoritatives hereafter) are servers thatknow the contents of a given DNS zone and can answer querieswithout asking other servers [11]

Resolvers on the other hand are servers that can ask on behalf ofothers queries to other servers [20] Stub resolvers run directly onclients and query one or a few recursive resolvers (shortened to stubsand recursives here) Recursives perform the full resolution of adomain name querying one or more authoritatives while cachingresponses to avoid repeatedly requesting popular domains (egcom or googlecom) Sometimes recursives operate in multipletiers with clients talking directly to R1 resolvers that forwardqueries to other Rn resolvers that ultimately contact authoritatives

In practice stubs are part of the client OS or browser recursivesare provided by ISPs and authoritatives are run by DNS providersor large organizations Multi-level recursives might have R1 at ahome router and Rn in the ISP or might occur in large public DNSproviders

22 Authoritative Replication and IP AnycastReplication of a DNS service is important to support high reliabilityand capacity and to reduce latency DNS has two complementarymechanisms to replicate service First the protocol itself supportsnameserver replication of DNS service for a zone (nl or examplenl)where multiple servers operate on different IP addresses listedby that zonersquos NS records Second each of these servers can runfrom multiple physical locations with IP anycast by announcingthe same IP address from each and allowing Internet routing (BGP)to associate clients with each anycast site Nameserver replicationis recommended for all zones and IP anycast is used by most largezones such as the DNS Root and most top-level domains [23 43] IPanycast is also widely used by public resolvers recursive resolversthat are open for use by anyone on the Internet such as GooglePublic DNS [12] OpenDNS [29] Quad9 [37] and 1111 [1]

Figure 2 illustrates the relationship of elements in the DNS in-frastructure when anycast is in place The recursive resolver R isannounced using an anycast prefix from five different anycast sites(R1 to R5) Each anycast site can have multiple servers with DNStraffic designated to them by load balancing algorithms

When the stub resolver sends a DNS query to the anycast ad-dress of R BGP forwards the query to the nearest site (R3 in this

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

Anycast Recursive (R)

Stub

R1

R3

R2

R4

R5

AT

AnycastUnicast

Figure 2 Stub resolver and Anycast Recursive

example) of Rmdashnote that ldquonearestrdquo metrics can vary [8] The BGPmapping between source and anycast destination is known as any-cast catchment which under normal conditions is very stable acrossthe Internet [53] On receiving the query R3 contacts the pertinentauthoritative AT to resolve the queried domain name The commu-nication from R3 toAT is done using R3rsquos unicast address ensuringthe reply from AT is sent back to R3 instead of any other anycastsite from R On receiving the answer fromAT R3 replies to the stubresolver with the answer to its query and for this reply R3 uses itsanycast address as source

23 DNS Caching with Time-to-Live (TTLs)DNS depends on caching to reduce latency to users and load onservers Authoritatives provide responses that are then cached in ap-plications stub resolvers and recursive resolvers We next describeits loose consistency model

An authoritative resolver defines the lifetime of each result byits Time-to-Live (TTL) although TTLs is not usually exposed tousers this information is propagated through recursive resolvers

Once cached by recursive resolvers cached results cannot re-moved they can only be refreshed response by a new query andresponse after the TTL expires

Some recursive resolvers discard long-lived cache entries aftera configurable timeout BIND defaults to dropping entries after 1week [17] and Unbound after 1 day [28]

Operators select TTLs carefully Content delivery networks (CDNs)often use DNS to steer users to different content servers They selectvery short TTLs (60 seconds or less) to force clients to re-queryfrequently providing opportunities to redirect clients with DNS inresponse to changes in load or server availability [30] AlternativelyDNS data for top-level domains often has TTLs of hours or daysSuch long TTLs reduce latency for clients (the reply can be reusedimmediately if it is in the cache of a recursive resolver) and reduceload on servers for commonly used top-level domains and slowlychanging DNSSEC information

3 DNS CACHING IN CONTROLLEDEXPERIMENTS

To understand the role of caching at recursive resolvers in protec-tion during failure of authoritative servers we first must understandhow often are cache lifetimes (TTLs) honored

In the best-case scenario authoritative DNS operators may ex-pect clients to be able to reach domains under their zones even iftheir authoritative servers are unreachable for as long as cachedvalues in the recursives remain ldquovalidrdquo (ie TTL not expired) Given

the large variety of recursive implementations we pose the follow-ing question from a user point-of-view can we rely on recursivescaching when authoritatives fail

To understand cache lifetimes in practice we carry out controlledmeasurements from thousands of clients These measurements de-termine how well caches work in the field complementing our un-derstanding of how open source implementations work from theirsource code This study is important because operational softwarecan vary and large deployments often use heavily customization orclosed source implementations [48]

31 Potential Impediments to CachingAlthough DNS records should logically be cached for the full TTLa number of factors can shorten cache lifetimes in practice cachesare of limited size caches may be flushed prematurely and largeresolvers may have fragmented caches We briefly describe thesefactors here understanding how often they occur motivates themeasurements we carry out

Caches are of limited size Unbound for example defaults to a4MB limit but the values are configurable In practice DNS resultsare small enough and caches large enough that cache sizes areusually not a limiting factor Recursive resolvers may also overriderecord TTLs imposing either a minimum or maximum value [52]

Caches can be flushed explicitly (at the request of the cacheoperator) or accidentally on restart of the software or reboot of themachine running the cache

Finally some recursive resolvers handle very high request ratesmdashconsider a major ISP or public resolver [12 29 37] Large recursiveresolvers are often implemented asmany separate recursives behinda load balancer or on IP anycast In such cases the caches may befragmented with each machine operating an independent cache orthey may share a cache of common names In practice these mayreduce the cache hit rate

32 Measurement DesignTo evaluate caching we use controlled experiments where we queryfrom specific names to authoritative servers we run from thousandsof RIPE Atlas sites Our goal is to measure whether the TTL wedefine for the RRs of our controlled domain is honored acrossrecursives

Authoritative servers we deploy two authoritatives that an-swer for our new domain name (cachetestnl) We place the author-itatives on virtual machines in the same datacenter (Amazon EC2in Frankfurt Germany) each at a distinct unicast IPv4 addressesEach authoritative runs BIND 9103 Since both authoritatives arein the same datacenter they will have similar latencies to recursivesso we expect recursives to evenly distribute queries between bothauthoritative servers [27]

Vantage Points We issue queries to our controlled domainfrom around 9k RIPE Atlas probes [39] Atlas Probes are distributedacross 33k ASes with about one third hosting multiple vantagepoints (VPs) Atlas software causes each probe to issue queries toeach of its local recursive resolvers so our VPs are the tuple ofprobe and recursive The result is that we have more than 15k VPs(Table 1)

Queries and Caching We take several steps to ensure thatcaching does not interfere with queries First each query is for a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TTL 60 1800 3600 86400 3600-10minProbes 9173 9216 8971 9150 9189Probes (val) 8725 8788 8549 8750 8772Probes (disc) 448 428 422 400 417

VPs 15330 15447 15052 15345 15397Queries 94856 96095 93723 95780 191931Answers 90525 91795 89470 91495 183388Answer (val) 90079 91461 89150 91172 182731Answers (disc) 446 334 323 323 657Table 1 Caching baseline experiments [38]

name unique to the probe each probe requests an AAAA recordfor probeidcachetestnl where probeid is the probersquos the uniqueidentifier Each reply is also customized In the AAAA reply weencode three fields that are used to determine the effectiveness ofcaching (sect34) Each IPv6 address in the answer is the concatenationof four values (in hex)

prefix is a fixed 64-bit value (fd0f3897faf7a375)serial is a 8-bit value incremented every 10 minutes (zone file

rotation) allowing us to associate replies with specific queryrounds

probeid is the unique Atlas probeID [40] encoded in 8 bits toassociate the query with the reply

ttl is a 16-bit value of the TTL value we configure per experi-ment

We increment the serial number in each AAAA record and reloadthe zone (with a new zone serial number) every 10 minutes Theserial number in each reply allows us to distinguish cached resultsfrom prior rounds from fresh data in this round

Atlas DNS queries timeout after 5 seconds reporting ldquono answerrdquoWe will see this occur in our emulated DDoS events

For example a VP with probeID 1414 shall send a DNS query forAAAA record for the domain name 1414cachetestnl and shouldreceive AAAA answer in the format $PREFIX15863c where$SERIAL=1and $TTL=60

We focus onDNS over UDP on IPv4 not TCP or IPv6We use onlyIPv4 queries from Atlas Probes and serve only IPv4 authoritativesbut the IPv6 may be used inside multi-level recursives Our workcould extend to cover other protocols but we did not want tocomplicate analysis the orthogonal issue of protocol selection Wefocus on DNS over UDP because it is by far the dominant transportprotocol today (more than 97 of connections for nl [50] and mostRoot DNS servers [16])

Query Load The query rate of our experiments is designed toexplicitly test how queries intersect with TTL experimentation andnot to reproduce real-world traffic rates Popular domains such ascom will be queried much more frequently than our query rates soour results represent lower-bounds on caching In sect4 we examinecaching rates with real-world names under nl testing a range ofname popularities

TTL TTL values vary significantly in DNS with top-leveldomains typically using 1 day TTLs while CDNs often use shortTTLs of 1 or 5 minutes Given this diversity of configurations weexplicitly design experiments that cover the range from 1 minute to1 day (60 s and 86400 s TTLs) Thus rather than trying to capturea single TTL that represents all possible configurations we studya range of TTLs to explore the full range of caching behavior sect4

examines real-world traffic to provide a view of how well cachingworks with the distribution of TTLs seen in actual queries

Representativeness of Atlas Locations and Software It iswell known that the global distribution of RIPE Atlas probes isuneven Europe has far more than elsewhere [5 6 46] Althoughquantitative data analysis might be generally affected by this distri-bution bias our qualitative analysis contributions and conclusionsdo not depend on the geographical location of probes

Atlas probes use identical stub resolver software but they aredeployed in diverse locations (homes businesses universities) andso see a diverse set of recursives vendors and versions Our studytherefore represents Atlas ldquoin the wildrdquo and does not try to studyspecific software versions or vendors Although we claim our studycaptures diverse recursive resolvers we do not claim they are repre-sentative of a ldquotypicalrdquo Internet client It complements prior studieson caching by establishing what Atlas sees an baseline neededwhen we study DDoS in sect5

33 DatasetsWe carried out five experiments varying the cache lifetime (TTL)and probing frequency from the VPs Table 1 lists the parameters ofexperiments In the first four measurements the probing intervalwas fixed to 20 minutes and TTL for each AAAA was set to 601800 3600 and 86400 seconds all frequently used TTL values Forthe fifth measurement we fixed the TTL value to 3600 seconds andreduced the probing interval to 10 minutes to get better resolutionof dynamics

In each experiment queries were sent from about 9k Atlas probesWe discard 400ndash448 of these (ldquoprobes (disc)rdquo about 44 to 49 ofprobes) that do not return an answer Successful Atlas probes querymultiple recursive resolvers each a Vantage Point so each experi-ment results in about 15k VPs We also discard 323ndash657 answers(ldquoanswers (disc)rdquo about 35 to 49 of answers) because they reporterror codes (for example SERVFAIL and REFUSED [21]) or theyare referrals instead of the desired AAAA records [15] (We providemore detail about referrals in Appendix A)

Overall about 93ndash96k queries to cachetestnl from the 9k probesat 20 minute pacing and about double that with 10 minute pacingExperiments last two to three hours with no interference betweenexperiments due to use of unique names We ensure that exper-iments are isolated from each other First we space experimentsabout one day apart (details in RIPE [38]) Second the IP addresses(and their records in cachetestnl) of both authoritative name serverschange in each experiment when we restart their VMs Finally wechange the replies in the AAAA records so we can detect any staleresults (see sect32)

34 TTL distribution expected vs observedWe next investigate how often recursive resolvers honor the fullTTL provided by authoritative servers Our goal is to classify thevalid DNS answers from Table 1 into four categories based onwhere the answer comes from and where we expect it to comefrom

AA answers expected and correctly from the authoritativeCC expected and correct from a recursive cache (cache hits)AC answers from the authoritative but expected to be from

the recursiversquos cache (a cache miss)

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

TTL 60 1800 3600 86400 3600-10mAnswers (valid) 90079 91461 89150 91172 1827311-answer VPs 38 51 49 35 17Warm-up (AAi) 15292 15396 15003 15310 15380Duplicates 25 23 25 22 23Unique 15267 15373 14978 15288 15357TTL as zone 14991 15046 14703 10618 15092TTL altered 276 327 275 4670 265

AA 74435 21574 10230 681 11797CC 235 29616 39472 51667 107760CCdec 4 5 1973 4045 9589

AC 37 24645 24091 23202 47262TTL as zone 2 24584 23649 13487 43814TTL altered 35 61 442 9715 3448

CA 42 179 305 277 515CAdec 7 3 21 29 65

Table 2 Valid DNS answers (expectedobserved)

CA answers from a recursiversquos cache but expected from theauthoritative (an extended cache)

To determine if a query should be answered by the cache of therecursive we track the state of prior queries and responses and theestimated TTL Tracking state is not hard since we know the initialTTL and all queries to the zone and we encode the serial numberand the TTL in the AAAA reply (sect32)

Cold Caches and Rewriting TTLsWe first consider queriesmade against a cold cache (the first query of a unique name) totest how many recursives override the TTL We know that thishappens at some sites such as at Amazon EC2 where their virtualmachines (VMs) default recursive resolver caps all TTLs to 60 s [36]

Table 2 shows the results of our five experiments in which weclassify the valid answers from Table 1 Before classifying themwe first disregard VPs that had only one answer (1-answer VPs)since we cannot evaluate their caches status with one answer only(maximum 51 VPs out of 15000 for the experiments) Then weclassify the remaining queries asWarm-up queries AAi all of whichare type AA (expected and answered by the authoritative server)

We see some duplicate responses for these we use the times-tamp of the very first AAi received We then classify each uniqueAAi by comparing the TTL value returned by the recursive withthe expected TTL that is encoded in the AAAA answer (fixed perexperiment) The TTL as zone line counts the answers we expect toget while TTL altered shows that a few hundred recursive resolversalter the TTL If these two values differ by more than 10 we reportTTL altered

We see that the vast majority of recursives honor small TTLswith only about 2 truncating the TTL (275 to 327 of about 15000depending on the experimentrsquos TTL) We and others (sect7) see TTLtruncation from multiple ASes The exception is for queries withday-long TTLs (86400 s) where 4670 queries (30) have shortenedTTLs (Prior work also reported that many public resolvers refreshesat 1 day [51]) We conclude that wholesale TTL shortening doesnot occur for TTLs of an hour or less

TTLs with Warm Cache We next consider a warm cachemdashsubsequent queries where we believe the recursive should have theprior answer cached and classify them according to the proposedcategories (AA CC AC and CC)

Figure 3 shows a histogram of this classifications (numbersshown on Table 2) We see that most answers we receive show

0

20000

40000

60000

80000

100000

120000

60s 1800s 3600s 86400s 3600s-10m

Miss 00

Miss 326

Miss 329

Miss 309

Miss 285

rem

ain

ing q

ueries

Experiment

AACC

ACCA

Figure 3 Classification of subsequent answers with warmcache

expected caching behavior For 60 s TTLs (the left bar) we expectno queries to be cached when we re-query 20minutes (1200 s) laterand we see few cache hits (235 queries ndash CC row on Table 2 ndash whichare due to TTL rewriting to values larger than 20min) We see onlya handful of CA-type replies where we expect the authoritativeto reply and the recursive does instead We conclude that undernormal operations (with authoritatives responding) recursive re-solvers do not serve stale results (as has been proposed when theauthoritative cannot be reached [19])

For longer TTLs we see cache misses (AC responses) fractions of28 to 33 (AC(Answer_(valid) minus (1minusAnswers+Warm-up)) Mostof the AC answers did not alter the TTL (AC-over) ie the cachemiss was not due to TTL manipulations (Table 2) We do see 9715TTL modifications (about 42 of ACs) when the TTL is 1 dayTTLs (86400 s) These TTL truncations are consistent with recur-sive resolvers that limit cache durations such as caps of 7 days inBIND [17] and 1 in unbound [28] by default (We provide moredetail about TTL manipulations in Appendix B)

We conclude that DNS caches are fairly effective with cachehits about 70 of the time This estimate is likely a lower boundwe are the only users of our domain and popular domains wouldsee cache hits due to requests from other users We only see TTLtruncation for day-long TTLs This result will help us understandthe role of caching when authoritatives are under stress

35 Public Recursives and CacheFragmentation

Although we showed that most requests are cached as expectedabout 30 are not We know that many DNS requests are served bypublic recursive resolvers today several of which exist [1 12 29 37]We also know that public recursives often use anycast and loadbalancing [48] and that that can result in caches that are fragmented(not shared) across many serversWe next examine howmany cachemisses (type AC replies) are due to public recursives

Although we control queriers and authoritative servers theremay be multiple levels of recursive resolvers in between From Fig-ure 1 we see the querierrsquos first-hop recursive (R1) and the recursivethat queries the authoritative (Rn) Fortunately queries and repliesare unique so we can relate queries to the final recursive knowingthe time (the query round) and the query source For each queryq we extract the IP address of Rn and compare against a list of IP

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TTL 60 1800 3600 86400 3600-10mAC Answers 37 24645 24091 23202 47262Public R1 0 12000 11359 10869 21955Google Public R1 0 9693 9026 8585 17325other Public R1 0 2307 2333 2284 4630

Non-Public R1 37 12645 12732 12333 25307Google Public Rn 0 1196 1091 248 1708other Rn 37 11449 11641 12085 23599

Table 3 AC answers public resolver classification

addresses for 96 public recursives (Appendix C) we obtain fromDuckDuckGo search for ldquopublic dnsrdquo done on 2018-01-15

Table 3 reexamines the AC replies from Table 2 With the ex-ception of the measurements with TTL of 60 s nearly half of ACanswers (cache misses) are from queries to public R1 recursivesand about three-quarters of these are from Googlersquos Public DNSThe other half of cache misses start at non-public recursives but10 of these eventually emerge from Googlersquos DNS

Besides identifying public recursives we also see evidence ofcache fragmentation in answers from caches (CC and CA) Some-times we see serial numbers in consecutive answers decrease Forexample one VP reports serial numbers 1 3 3 7 3 3 suggestingthat it is querying different recursives one with serial 3 and anotherwith serial 7 in its cache We show these occurrences in Table 2 asCCdec and CAdec With longer TTLs we see more cache fragmen-tation with 45 of answers showing fragmentation with day-longTTLs

From these observations we conclude that cache misses resultfrom several causes (1) use of load balancers or anycast whereservers lack shared caches (2) first-level recursives that do notcache and have multiple second-level recursives and (3) cachesmay reset between the somewhat long probing interval (10 or 20minutes) Causes (1) and (2) occur in public resolvers (confirmedby Google [12]) and account for about half of the cache misses inour measurements

4 CACHING PRODUCTION ZONESIn sect3 we show that about one-third of queries do not conform withcaching expectations based on controlled experiments to our testdomain (Results may be better for caches that prioritize popularnames) We next examine this question for specific records in nlthe country code domain (ccTLD) for the Netherlands and the Root() DNS zone With traffic from ldquothe wildrdquo and a measurement targetused by millions this section uses a domain popular enough to stayin-cache at recursives

41 Requests at nlrsquos AuthoritativesWe apply this methodology to data for nl country-code top-leveldomain (ccTLD) We look specifically at the A-records for the name-servers of nl ns[1-5]dnsnl

MethodologyWe use passive observations of traffic to the nlauthoritative servers

For each target name in the zone and source (some recursiveserver identified by IP address) we build a timeseries of all requestsand compute their interarrival time ∆ Following the classificationfrom sect34 we label queries as AC if ∆ lt TTL showing an unnec-essary query to the authoritative AA if ∆ ge TTL an expected or

0 01 02 03 04 05 06 07 08 09

1

0 2000 4000 6000 8000 10000

CD

F

Δ t

Figure 4 ECDF of the median ∆t for recursives with at least5 queries to ns1-ns5dnsnl (TTL of 3600 s)

delayed cache refresh (We do not see cache hits and so there areno CC events)

Dataset At the time of our analysis (February 2018) there were8 authoritative servers for the nl zone We collect traffic for the 4unicast and one anycast authoritative servers and store the data inENTRADA [55] for analysis

Since our data for nl is incomplete and we know recursives willquery all authoritatives over time [27] our analysis represents aconservative estimate of TTL violationsmdashwe expect to miss someCA-type queries from resolvers to non-monitored authoritatives

We collect data for a period of six hours on 2018-02-22 starting at1200 UTCWe only evaluate recursives that sent at least five queriesfor our domains of interest omitting infrequent recursives (theydo not change results noticeably) We discard duplicate queries forexample a few retransmissions (less than 001 of the total queries)In total we consider more than 485k queries from 7779 differentrecursives

Results Figure 4 shows the distribution of ∆t that we observein our measurements reporting the median ∆t for any resolver thatsends at least 5 queries

About 28 of queries are frequent with an inter-arrival less than10 s and 32 of these are sent to multiple authoritatives We believethese are due to recursives submitting queries in parallel to speedup replies (perhaps the ldquoHappy Eyeballsrdquo algorithm [45])

Since these closely-timed queries are not related to recursivecaching we exclude them from analysis The remaining data is 348kqueries from 7703 different recursives

The largest peak is at 3600 s what was expected the name wasqueried and cached for the full hour TTL then the next requestcauses the name to be re-fetched These queries are all of type AA

The smaller peak around 1800 s as well as queries with othertimes less than 3600 s correspond to type AC-queriesmdashqueries thatcould have been supplied from the cache but were not 22 ofresolvers sent most of their queries within an time interval thatis less than 3600 s or even more frequent These AC queries occurbecause of TTL limiting cache fragmentation or other reasons thatclear the cache

42 Requests at the DNS RootIn this section we perform a similar analysis as for sect41 in whichwe look into DNS queries received at all Root DNS servers (except

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

075

08

085

09

095

1

5 10 15 20 25 30

CD

F

number of queries

F-Root

H-Root

All roots

Figure 5 Distribution of the number of queries for the DSrecord of nl received for each recursive Dataset DNS-OARCDITL on 2017-04-12t0000Z for 24 hours All Root serverswith similar distributions are shown in light-gray lines

G-Root) and create a distribution of the number of queries receivedper source IP address (ie per recursive)

In this analysis we use data from the DITL (Day In The Life)dataset of 2017 available at DNS-OARC [9] We look at all DNSqueries received for the DS record of the domain nl received at theRoot DNS servers along the entire day on April 12 2017 (UTC) Thisdataset consists of queries from more than 703k unique recursivesseen across all Root servers Note that the DS record for nl hasa TTL of 86400 seconds (24 hours) That is in theory one couldexpect to see just one query per recursive arriving at a given rootletter for the DS record of nl within the 24-hour interval

Each line in Figure 5 shows the distribution of the total numberof queries received at the Root servers from individual recursivesasking for the DS record of nl Besides F- and H-Root the distribu-tion is similar across all Root servers these are plotted in light-graylines F-Root shows the ldquomost friendlyrdquo behavior from recursiveswhere around 5 of them sent 5 or more queries for nl As opposedto F H-Root (dotted red line) shows the ldquoworstrdquo behavior fromrecursives where more than 10 of them sent 5 or more queriesfor nl within the 24-hour period

The solid black line in Figure 5 shows the distribution for allthe queries across all Root servers The majority (around 87) ofrecursives does send only one query within the 24-hour intervalHowever considering all Root servers we see around 13 of recur-sives that have sent multiple queries Note that the distributionsshown in Figure 5 have (very) long tails and we see up to more than218k queries from a single recursive within the 24-hour period forthe nl DS record ie roughly one query every 4 seconds from thesame IP address for the same DS record

Discussion we conclude that measurements of popular domainswithin nl (sect41) and the Roots (sect42) show that about 63 and 87of recursives honor the full TTL respectively These results areroughly in-line with our observations with RIPE Atlas (sect3)

5 THE CLIENTrsquoS VIEW OF AUTHORITATIVESUNDER DDOS

We next use controlled experiments to evaluate how DDoS attacksat authoritative DNS servers impacts client experience Our studiesof caching in controlled experiments (sect3) and passive observations(sect4) have shown that caching often works but not alwaysmdashabout

70 of controlled experiments and 30 of passive observations seefull cache lifetimes Since results of specific experiments vary wesweep the space of attack intensities to understand the range ofresponse from complete failure of authoritative servers to partialfailures

51 Emulating DDoSTo emulate DDoS attacks we begin with the same test domain(cachetestnl) we used for controlled experiments in sect3 We run anormal DNS service for some time querying from RIPE Atlas Aftercaches are warm we then simulate a DDoS attack by droppingsome fraction or all incoming DNS queries to each authoritative(We drop incoming traffic randomly with Linux iptables As suchpacket drop is not biased towards any recursive) After we begindropping traffic answers come either from caches at recursives orfor partial attacks from a lucky query that passes through

This emulation of DDoS captures traffic loss that occurs in DDoSattack as router queues overflow This emulation is not perfectsince we simulate loss at the last hop-router but in real DDoSattacks packets are often lost on access links near the target Ouremulation approximates this effect with one aggregate loss rate

DDoS attacks are also accompanied by queueing delay sincebuffers at and near the target are full We do not model queueingdelay although we do observe latency increasing due to retries Inmodern routers queueing delay due to full router buffers should beless than the retry interval In addition observations during real-world DDoS events show that the few queries that are successfulsee response times that are not much higher than typical [23]suggesting that loss (and not delay) is the dominant effect of DDoSin practice However a study that adds queueing latency to theattack model is interesting future work

52 Clients During Complete AuthoritativesFailure

We first evaluate the worst-case scenario for a DNS operator com-plete unreachability of all authoritative name servers Our goal isto understand when and for how long caches cover such an outage

Table 4 shows Experiments A B and C which simulate completefailure In Experiment A each VP makes only one query beforethe DDoS begins In Experiment B we allow several queries to takeplace and Experiment C allows several queries with a shorter TTL

Caches Protect Some We first consider Experiment A withone query that warms the cache immediately followed by the attackFigure 6a shows these responses over time with the onset of theattack the first downward arrow between 0 and 10 minutes andwith the cache expired after the second downward arrow between60 and 70 minutes We see that after the DDoS starts but before thecache has fully expired (between the downward arrows) initially30 and eventually 65 of queries fail with either no answer or aSERVFAIL error While not good this does mean that 35 to 70of queries during the DDoS are successfully served from the cacheBy contrast shortly after the cache expires almost all queries fail(only 25 VPs or 02 of the total seem to provide stale answers)

Caches Fill at Different Times In a more realistic scenarioVPs have filled their caches at different times In Experiment Acaches are freshly filled and should last for a full hour after thestart of attack Experiment B is designed for the opposite and worst

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiment ParametersTTL DDoS DDoS queries total probe failurein sec start dur before dur interval

A 3600 10 60 1 120 10 100 (both NSes)B 3600 60 60 6 240 10 100 (both NSes)C 1800 60 60 6 180 10 100 (both NSes)D 1800 60 60 6 180 10 50 (one NS)E 1800 60 60 6 180 10 50 (both NSes)F 1800 60 60 6 180 10 75 (both NSes)G 300 60 60 6 180 10 75 (both NSes)H 1800 60 60 6 180 10 90 (both NSes)I 60 60 60 6 180 10 90 (both NSes)

ResultsTotal Valid VPs Queries Total Validprobes probes answers answers

A 9224 8727 15339 136423 76619 76181B 9237 8827 15528 357102 293881 292564C 9261 8847 15578 258695 199185 198197D 9139 8708 15332 286231 273716 272231E 9153 8708 15320 285325 270179 268786F 9141 8727 15325 278741 259009 257740G 9206 8771 15481 274755 249958 249042H 9226 8778 15486 269030 242725 241569I 9224 8735 15388 253228 218831 217979

Table 4 DDoS emulation experiments [38] DDoS start durations and probe interval are given in minutes

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110

cache-only cache-expired

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment A 3600-10min-1down arrows indicate DDoS start andcache expiration

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment B 3600-10min-1down-1up arrows indicate DDoS startand recovery

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

cache-only

cache-expired

(c) Experiment C 1800-10min-1down-1up arrows indicate DDoS startcache expiration and recovery

Figure 6 Answers received during DDoS attacks

0

4000

8000

12000

16000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170answ

ers

minutes after start

AA

CC

CA

Figure 7 Timeseries of answers for Experiment B

case we begin warming the cache one hour before the attack andquery 6 times from each VP Other parameters are the same withthe attack lasting for 60 minutes (also the cache duration) but thenwe restore the authoritatives to service

Figure 6b shows the results of Experiment B While about 50 ofVPs are served from the cache in the first 10 minute round after theDDoS starts the fraction served drops quickly and is at only about3 one hour later Three factors are in play here most caches werefilled 60 minutes before the attack and are timing out in the firstround While the timeout and query rounds are both 60 minutesapart Atlas intentionally spreads queries out over 5 minutes sowe expect that some queries happen after 59 minutes and others 61minutes

Second we know some large recursives have fragmented caches(sect35) so we expect that some of the successes between times 70and 110 minutes are due to caches that were filled between times10 and 50 minutes This can actually be seen in Figure 7 where weshow a timeseries of the answers for Experiment B where we seeCC (correct cache responses) between times 60 and 90

Third we see an increase in the number of CA queries that areanswered by the cache with expired TTLs (Figure 7) This increaseis due to servers serving stale content [19]

Caches Eventually All Expire Finally we carry out a thirdemulation but with half the cache lifetime (1800 s or 30minutesrather than the full hour) Figure 6c shows response over timeThese results are similar to Experiment B with rapid fall-off whenthe attack starts as caches age After the attack has been underwayfor 30minutes all caches must have expired and we see only a few(about 26) residual successes

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

53 Discussion of Complete FailuresOverall we see that caching is partially successful in protecting dur-ing a DDoS With full valid caches half or more VPs get serviceHowever caches are filled at different times and expire so an op-erator cannot count on a full cache duration for any customerseven for popular (ldquoalways in the cacherdquo) domains The protectionprovided by caches depends on their state in the recursive resolversomething outside the operatorrsquos control In addition our evalua-tion of caching in sect3 showed that caches will end early for someVPs

Second we were surprised that a tiny fraction of VPs are suc-cessful after all caches should have timed out (after the 80 minutesperiod in Experiment A and between 90 and 110 minutes in Exper-iment C) These successes suggest an early deployment of ldquoservestalerdquo something currently under review in the IETF [19] is to servea previously known record beyond its TTL if authoritatives areunreachable with the goal of improving resilience under DDoS Weinvestigated the Experiment A where see that 1048 answers of the1140 successes in the second half of the outage These successesare from 471 VPs (and 215 recursives) most of them answered byOpenDNS and Google public DNS servers suggesting experimen-tation not yet widespread Out of these 1048 queries 1031 return aTTL value equals to 0 as specified in the IETF stale draft [19]

54 Client Reliability During PartialAuthoritative Failure

The previous section examined DDoS attacks that result in com-plete failure of all authoritatives but often DDoS attacks result inpartial failure with 50 or 90 packet loss at the authoritatives(For example consider the November 2015 DDoS attack on theDNS Root [23]) We next study experiments with partial failuresshowing that caching and retries together nearly fully protect 50DDoS events and protect half of VPs even during 90 events

We carry out several Experiments D to I in Table 4 We follow theprocedure outlined in sect51 looking at the DDoS-driven loss ratesof 50 75 and 90 with TTLs of 1800 s 300 s and 60 s Graphsomitted due to space can be found in Appendix D

Near-Full Protection from Caches During Moderate At-tacks We first consider Experiment E a ldquomildrdquo DDoS with 50loss with VP success over time in Figure 8a In spite of a loss ratethat would be crippling to TCP nearly all VPs are successful in DNSThis success is due to two factors first we know that many clientsare served from caches as was shown in Experiment A with fullloss (Figure 6a) Second most recursives retry queries so they re-cover from loss of a single packet and are able to provide an answerTogether these mean that failures during the first 30 minutes of theevent is 85 slightly higher than the 48 fraction of failures beforethe DDoS For this experiment the TTL is 1800 s (30minutes) sowe might expect failures to increase halfway through the DDoSWe do not see any increase in failures because caching and retriesare synergistic a successful retried query will place the answer in acache for a later query The importance of this result is that DNScan survive moderate-size attacks when caching is possible While apositive retries do increase latency something we study in sect55

Attack IntensityMattersWhile clients do quite well with 50loss at all authoritatives failures increase with the intensity of theattack

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

50 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment E (1800-50p-10min) 50 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

Ans

wer

sminutes after start

OK SERVFAIL No answer

(b) Experiment F (1800-75p-10min) 75 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(c) Experiment H (1800-90p-10min) 90 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(d) Experiment I (60-90p-10min) 90 packet loss

Figure 8 Answers received during DDoS attacks first andsecond vertical lines show start and end of DDoS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiments F and H shown in Figure 8b and Figure 8c increasethe loss rate to 75 and 90We see the number of failures increasesto about 190 with 75 loss and 403 with 90 loss It is importantto note that roughly 60 the clients are still served even with 90loss

We also see that this level of success is consistent over the entirehour-long DDoS event even though the cache duration is only30minutes This consistency confirms the importance of cachingand retries in combination

To verify the effects of this interaction Experiment I changesthe caching duration to 60 s less than one round or probing Com-paring Experiment I in Figure 8d to H in Figure 8c we see that thefailure rate increases from 30 to about 63 However even withno caching about 37 of queries still are answered due to resolversthat serve stale content and recursives retries We investigate retriesin sect6

55 Client Latency During Partial AuthoritativeFailure

We showed that client reliability is higher than expected duringfailures (sect54) due to a combination of caching and retries Wenext consider client latency Latency will increase during the DDoSbecause of retries and queueing delay but we will show that latencyincreases less than one might expect due to caching

To examine latency we return to Experiments D through I (Ta-ble 4) but look at latency (time to complete a query) rather thansuccess For these experiments clients timeout after 5 s

Figures 9a to 9d show latency during each emulated DDoS sce-nario (experiments with figures omitted here are in Appendix D La-tencies are not evenly distributed since some requests get throughimmediately while others must be retried one or more times so inaddition to mean we show 50 75 and 90 quantiles to characterizethe tail of the distribution

We emulate DDoS by dropping requests (sect51) and hence laten-cies reflect retries and loss but not queueing delay underrepresent-ing latency in real-world attacks However their shape (some lowlatency and a few long) is consistent with and helps explain whathas been seen in the past [23]

Beginning with Experiment E the moderate attack in Figure 9awe see no change to median latency This result is consistent withmany queries being handled by the cache and half of those nothandled by the cache getting through anyway We do see higherlatency in the 90ile tail reflecting successful retries This tail alsoincreases the mean some

This trend increases in Experiment F in Figure 9b where 75 ofqueries are lost Now we see the 75ile tail has increased as hasthe number of unanswered queries and the 90ile is twice as longas in Experiment E

We see the same latency in Experiment H with DDoS causing90 loss We set the timeouts to 5 s so the larger attack results inmore unsuccessful queries but latency for successful queries is notmuch worse than with 75 loss Median latency is still low due tocached replies

Finally Experiment I greatly reduces opportunities for cachingby reducing cache lifetime to one minute Figure 9d shows that lossof caching increases median RTT and significantly increases thetail latency Compared with Figure 9c (same packet loss ratio but

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment E 50 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160la

tency (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment F 75 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(c) Experiment H 90 packet loss(1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(d) Experiment I 90 packet loss (60 s TTL)

Figure 9 Latency results Shaded area indicates the intervalof an ongoing DDoS attack

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

1800 s TTL) we can clearly see the benefits of caching in terms oflatency (in addition to reliability) a half-hour TTL value reducedthe latency from 1300ms to 390ms Longer TTLs also help reducetail latency relative to shorter TTLs (compare for example the90ile RTT in Experiments I vs H in Figure 9)

Summary DDoS effects often increase client latency For mod-erate attacks increased latency is seen only by a few ldquounluckyrdquoclients whose do not see a full cache and whose queries are lostCaching has an important role in reducing latency during DDoSbut while it can often mitigate most reliability problems it cannotavoid latency penalties for all VPs Even when caching is not avail-able roughly 40 of clients get an answer either by serving staleor retries as we investigate next

6 THE AUTHORITATIVErsquoS PERSPECTIVEResults of partial DDoS events (sect54) show that DNS is surprisinglyreliablemdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches These results aredue to a combination of caching and retries We next examine thisfrom the perspective of the authoritative server

61 Recursive-Authoritative Traffic during aDDoS

We first ask are retries by recursive resolvers responsible for thesuccess rates observed in sect54 To investigate this question wereturn the partial DDoS experiments and look at how many queriesare sent to the authoritative servers We measure queries beforethey are dropped by our simulated DDoS Recursives must makemultiple queries to resolve a name We break out each type of queryfor the nameserver (NS) the nameserverrsquos IPv4 and v6 addresses(A-for-NS and AAAA-for-NS) and finally the desired query (AAAA-for-PID) Note that the authoritative is IPv4 only so AAAA-for-NSis non-existent and subject to negative caching while the otherrecords exist and use regular caching

We begin with the DDoS causing 75 loss in Figure 10a For thisexperiment we observe 18407 unique IP addresses of recursives(Rn ) querying for AAAA records directly to our authoritativesDuring the DDoS queries increase by about 35times We expect 4trials since the expected number of tries until success with lossrate p is (1 minus p)minus1 For this scenario results are cached for up to30 minutes so successful queries are reused in recursive cachesThis increase occurs both for the target AAAA record and alsofor the non-existent AAAA-for-NS records Negative caching forour zone is configured to 60 s making caching of NXDOMAINs forAAAA-for-NS less effective than positive caches

The offered load on the server increases further with more loss(90) as shown in Experiment H (Figure 10b) The higher loss rateresults in a much higher offered load on the server average 82timesnormal

Finally in Figure 10c we reduce the effects of caching at a 90DDoS and with a TTL of 60 s Here we see also about 81times morequeries at the server before the attack Comparing this case toExperiment H caching reduces the offered load on the server byabout 40

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(a) Experiment F 1800-75p-10min 75 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(b) Experiment H 1800-90p-10min 90 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(c) Experiment I 60-90p-10min 90 packet loss

Figure 10 Number of received queries by the authoritativeservers Shaded area indicates the interval of an ongoingDDoS attack

Implications The implication of this analysis is that legitimateclients ldquohammerrdquo with retries the already-stressed server during aDDoS For clients retries are important to get reliability and eachclient independently chooses to retry

The server is already under stress due to the DDoS so these re-tries add to that stress However the DDoS traffic is almost certainlymuch larger than the retried of legitimate traffic (A server experi-encing a volumetric attack causing 90 loss must be receiving 10timesits capacity Regular traffic is a small fraction of normal capacity soeven 4times regular is still much less than the attack traffic) The multi-plier for retried legitimate traffic depends on the implementationsstub and recursive resolver as well as application-level retries anddefection (users hitting reload in their browser and later giving up)Our experiment omits application-level retries and likely gives a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

lower bound We next examine specific recursive implementationsto see their behavior

62 Sources of Retries Software and Multi-levelRecursives

Experiments in the prior section showed that recursive resolversldquohammerrdquo authoritatives when queries are dropped We reexamineDNS software (since 2012 [56]) and additionally show deploymentsamplify retries

Recursive Software Prior work showed that recursive serversretry many times when an authoritative is unresponsive [56] withevaluation of BIND 97 and 98 DNSCache UnboundWindowsDNSand PowerDNS We studied retries in BIND 9103 and Unbound158 to quantify the number of retries Examining only requestsfor AAAA records we see that normal requests with a responsiveauthoritative ask for the AAAA records for all authoritatives andthe target name (3 total requests when there are 2 authoritatives)When all authoritatives are unavailable we see about 7times morerequests before the recursives time out (Exact numbers vary indifferent runs but typically each request is made 6 or 7 times)Such retries are appropriate provided they are paced (both useexponential backoff) they explain part of the increase in legitimatetraffic during DDoS events Full data is in Appendix E

Recursive DeploymentAnother source of extra retries is com-plex recursive deployments We showed that operators of largerecursives often use complex multi-level resolution infrastructure(sect35) This infrastructure can amplify the number of retries duringreachability problems at authoritatives

To quantify amplification we count both the number of Rn re-cursives and AAAA queries for each probe ID reaching our author-itatives Figure 11 show the results for Experiment I These valuesrepresent the amplification in two ways during stress more Rnrecursives will be used for each probe ID and these Rn will generatemore queries to the already stressed authoritatives As the figuresshow the median number of Rn recursives employed doubles (from1 to 2) during the DDoS event as does the 90ile (from 2 to 4) Themaximum rises to 39 The number of queries for each probe IDgrows more than 3times from 2 to 7 Worse the 90ile grows morethan 6times (3 queries to 18) The maximum grows 535times reachingup to 286 queries for one single probe ID This value however isa lower bound given there are a large number of A and AAAAqueries that ask for NS records and not the probe ID (AAAA andA-for NS in Figure 10)

We can also look at the aggregate effects of retries created by thecomplex recursive infrastructure Figure 12 shows the timeseries ofunique IP addresses of Rn observed at the authoritatives Before theDDoS period for Experiment I with TTL of 60 s we see a constantnumber of recursives reaching our authoritatives ie all queriesshould be answered by authoritatives (no caching at this TTL value)For experiments F and H both with TTL of 1800 s the number ofrecursives reaching our authoritative oscillates before the DDoSpeaks are observed when caches expire as expected

During the DDoS we observe a similar behavior for all threeexperiments in Figure 12 as packets are dropped at the authori-tative (at rates of 75 90 and 90 for F H and I respectively) wesee an increase on the number of Rn recursives querying our au-thoritatives for experiments F and H we see drops when caching

1

10

100

1000

0 20 40 60 80 100 120 140 160 180 1

10

100

1000

Rn-p

erP

ID

AA

AA

-for-

PID

minutes after start

Rn-per-PID-medianRn-per-PID-90-tileRn-per-PID-max

AAAA-for-PID-medianAAAA-for-PID-90-tileAAAA-for-PID-max

Figure 11 Rn recursives and AAAA queries used in Experi-ment I normalized by the number of probe IDs

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140 160 180

R

n r

eachin

g A

T

minutes after start

Experiment FExperiment HExperiment I

Figure 12 Unique Rn recursives addresses observed at au-thoritatives

is expected but not for experiment I The reason for this behavioris that the underlying layer of recursives starts forwarding queriesto other recursives which is amplified in the end (We show thisbehavior for an individual probe in Appendix F where we observethe growth in the number of queries received at the authoritativesand the number of recursives used)

Most complex resolution infrastructures are proprietary (as faras we know only one study has examined them [48]) so we cannotmake recommendations about how large recursive resolvers oughtto behave We suggest that the aggregate traffic of large recursiveresolvers should strive to be within a constant factor of singlerecursives perhaps a factor of 4 We also encourage additionalstudy of large recursive resolvers and their operators to shareinformation about their behavior

7 RELATEDWORKCaching by Recursives Several groups have shown that DNScaching can be imperfect Hao and Wang analyzed the impact ofnonce domains on DNS recursiversquos caches [13] Using two weeksof data from two universities they showed that filtering one-timedomains improves cache hit rates In two studies Pang et al [31 32]reported that web clients and local recursives do not always honorTTL values provided by authoritatives Almeida et al [2] analyzedDNS traces of a mobile operator and used a mobile applicationto see TTLS in practice They find that most domains have shortTTLs (less than 60 s) and report and evidence of TTL manipulationby recursives Schomp et al [48] demonstrate widespread use of

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

multi-level recursives by large operators as well as TTL manip-ulation Our work builds on this prior work examining cachingand TTL manipulation systematically and considering its effectson resilience

DNS client behavior Yu et al investigated how stubs and re-cursives select authoritative servers and were the first to demon-strate the large number of retries when all authoritatives are un-available [56] We also investigated how recursives select authori-tative servers in the wild and found that recursives tend to preferauthoritatives with shorter latency but query all authoritatives fordiversity [27] We confirm Yursquos work and focus on authoritativeselection during DDoS from several perspectives

Authoritatives during DDoS We investigated how the RootDNS service behaved during the Nov 2015 DDoS attacks [23] Thisreport focuses on the interactions of IP anycast and both latencyand reachability as seen from RIPE Atlas Rather than look at ag-gregate behavior and anycast our methodology here examines howclients interact with their recursive resolvers while this prior workfocused on authoritatives only bypassing recursives In additionhere we have full access to clients and authoritatives traffic dur-ing our experiments and we evaluate DDoS with controlled lossrates The prior study has incomplete data and focuses on specificresults of two events These differences stem from their study ofnatural experiments from real-world events and our controlledexperiments

8 IMPLICATIONSWe evaluated DNS resilience showing that caches and retries canmitigate much of the harm from a DDoS attack provided the cacheis full and some requests can get to authoritative servers The keyimplication of our study is to explain differences in the outcome ofrecent DDoS attacks

Recent DDoS attacks on DNS services have seen very differ-ent outcomes for users The Root Server System was a target inNov 2015 [41] and June 2016 [42] The DNS Root has 13 letterseach an authoritative ldquoserverrdquo implemented with some or many IPanycast instances Analysis of these DDoS events showed that theireffects were uneven across letters for some most or all anycastinstances showed high loss while other letters showed little or noloss [23] However the Root Operators state ldquoThere are no knownreports of end-user visible error conditions during and as a resultof this incident Because the DNS protocol is designed to cope withpartial reachability rdquo [41]

In Oct 2016 a much larger attack was directed at Dyn a providerof DNS service for many second-level domains [14] Although Dynhas a capable infrastructure and immediately took steps to addressservice problems there were reports of user-visible service disrup-tion in the technical and even popular press [34] Reports describeintermittent failure of prominent websites including ldquoTwitter Net-flix Spotify Airbnb Reddit Etsy SoundCloud and The New YorkTimesrdquo each a direct or indirect customer of Dyn at the time

Our work helps explain these very different outcomes The RootDNS saw few or no user-visible problems because data in the rootzone is cachable for a day or more and because multiple letters andmany anycast instances were continuously available (All measure-ments in this paragraph are as of 2018-05-22) Records in the rootzone have TTLs of 1 to 6 days and wwwroot-serversorg reports

922 anycast instances operating across the 13 authoritative serversDyn also operates a large infrastructure (httpsdyncomdnsnetwork-map reports 20 ldquofacilitiesrdquo) and faced a larger attack (reportsof 12 Tbs [47] compared to estimates of 35 Gbs for the Nov 2015root attack [23]) But a key difference is all of the Dynrsquos customerslisted above use DNS-based CDNs (for a description see [7]) withmultiple Dyn-hosted DNS components with TTLs that range from120 to 300 s

In addition to explaining the effects our experiments help get tothe root causes behind these outcomes Users of the Root benefitedfrom caching and saw performance like Experiment E (Figure 8a)because root contents (TLDs like com and country codes) are popu-lar and certainly cached in recursives and because some root letterswere always available to refresh caches (either through a successfulnormal query or a retry) By contrast users requiring domainswith very short TTLs (like the websites that had problems) receiveperformance more like Experiment I (Figure 8d) or Experiment C(Figure 6c) Even when some requests succeed an cache a popularname short TTLs cause caches to clear quickly

This example shows the importance of DNSrsquos multiple methodsof resilience (caching retries and at least some availability at oneauthoritative) It suggests that CDN operators may wish to considerlonger timeouts to allow caching to help and give DNS operatorsdeploy defenses Experiment H suggests 30 minutes Figure 8c

Configuring short TTLs serves a role in CDNs that use DNSto direct clients to different application-level servers Short TTLsallow for re-provisioning during DDoS attacks on web servers butthat leaves DNS servers vulnerable This tension suggests trafficscrubbing by routing changeswith longDNS TTLsmay be preferredto short DNS TTLs so that both layers can be robust Howeverthe complexity of interactions between DNS at multiple levels andCDNs suggests that more study is needed before recommendingspecific settings

Finally this evaluation helps complete our picture of DNS la-tency and reliability for DNS services that may consist of multipleauthoritatives some or all using IP anycast with multiple sites Tominimize latency prior work has shown a single authoritative usingIP anycast should maximize geographic dispersion of sites [46] Thelatency of an overall DNS service with multiple authoritatives canbe limited by the one with largest latency [27] Prior work aboutresilience to DDoS attack has shown that individual IP anycastsites will suffer under DDoS as a function of the attack traffic thatsite receives relative to its capacity [23] We show that the overallreliance of a DNS service composed of multiple authoritatives usingIP anycast tends to be as resilient as the strongest individual author-itative The reason for these opposite results is that in both casesrecursive resolvers will try all authoritatives of a given service Forlatency they will sometimes choose a distant authoritative butfor resilience they will continue until they find the most availableauthoritative

9 CONCLUSIONSThis paper represents the first study of how the DNS resolutionsystem behaves when authoritative servers are under DDoS attackCaching and retries at recursive resolvers are key factors in thisbehavior We show that together caching and retries by recursiveresolvers greatly improve the resilience of the DNS as a whole In

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 2: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

that these factors ldquohelprdquo but knowing how and how much eachcontributes builds confidence in defenses We consider this questionboth as an operator of an authoritative server and as a user definingthe DNS user experience latency and reliability users should expect

Our first contribution is to build an end-to-end understandingof DNS caching Our key result is that caching often behaves asexpected but about 30 of the time clients do not benefit from cachingWhile prior work has shown DNS resolution infrastructure can bequite complex [48] we establish a baseline DNS user experienceby assessing the prevalence of DNS caching in the ldquowildrdquo throughboth active measurements (sect3) and through analysis of passive datafrom two DNS zones (nl and the root zone sect4)

Our second contribution is to show that DNS mechanisms ofcaching and retries provide significant resilience client user experienceduring denial-of-service (DDoS) attacks (sect5) For example about halfof the clients continue to receive service during a full outage ifcaches are filled and do not expire during the attack Often DDoSattacks cause very high loss but not a complete outage When a fewqueries succeed caches amplify their benefits even for attacks thatare longer than cache lifetime With very heavy query loss (90) onall authoritatives full caches protect half of the clients and retriesprotect 30 With a DDoS that causes 50 packet loss nearly allclients succeed although with greater latency than typical

Third we show that there is a large increase in legitimate trafficduring DDoS attacksmdashup to 8times the number of queries (sect6) WhileDNS servers are typically heavily overprovisioned this result sug-gests the need to review by howmuch It also shows the importancethat stub and recursive resolvers follow best practices and expo-nentially back-off queries after failure so as to not add fuel to theDDoS fire

Our final contribution is to suggest why users have seen rela-tively little impact from root servers DDoSes while customers fromsome DNS providers quickly felt attacks (sect8) When cache lifetimesare longer than the duration of a DDoS attack many clients willsee service for names popular enough to be cached While manywebsites use short cache timeouts to support control with DNS-based load balancing they may wish to consider longer timeoutsas part of strategies for DDoS defense Retries provide additionalcoverage preventing failures during large attacks

This technical report [25] follows the conference paper at ACMIMC 2018 [26] adding additional information in several appendicesto fill in details about some experiments and add additional sce-narios that supplement the paper The technical report was firstreleased inMay 2018 and was updated in September 2018 to includeinput from the IMC reviewers and shepherd

All public datasets from this paper is available [24] with ourRIPE Atlas data also available from RIPE [38] Privacy concernsprevent release of nl and Root data (sect4)

2 BACKGROUNDAs background we briefly review the components of the DNSecosystem and how they interact with IP anycast

21 DNS Resolvers Stubs Recursives andAuthoritatives

Figure 1 shows the relationship between three components of DNSresolvers stubs and recursive resolvers and authoritative servers

Stub Resolvereg OSapplications

Recursives(1st level

eg modem)

Recursives(nth level)

eg ISP resolv

AuthoritativeServers

eg ns1examplenl

Stub

R1aCR1a

R1b CR1b

RnaCRna

RnnCRnb

AT1 ATn

Figure 1 Relationship between stub resolver (yellow) recur-sive resolvers (red) with their caches (blue) and authorita-tive servers (green)

Authoritative servers (authoritatives hereafter) are servers thatknow the contents of a given DNS zone and can answer querieswithout asking other servers [11]

Resolvers on the other hand are servers that can ask on behalf ofothers queries to other servers [20] Stub resolvers run directly onclients and query one or a few recursive resolvers (shortened to stubsand recursives here) Recursives perform the full resolution of adomain name querying one or more authoritatives while cachingresponses to avoid repeatedly requesting popular domains (egcom or googlecom) Sometimes recursives operate in multipletiers with clients talking directly to R1 resolvers that forwardqueries to other Rn resolvers that ultimately contact authoritatives

In practice stubs are part of the client OS or browser recursivesare provided by ISPs and authoritatives are run by DNS providersor large organizations Multi-level recursives might have R1 at ahome router and Rn in the ISP or might occur in large public DNSproviders

22 Authoritative Replication and IP AnycastReplication of a DNS service is important to support high reliabilityand capacity and to reduce latency DNS has two complementarymechanisms to replicate service First the protocol itself supportsnameserver replication of DNS service for a zone (nl or examplenl)where multiple servers operate on different IP addresses listedby that zonersquos NS records Second each of these servers can runfrom multiple physical locations with IP anycast by announcingthe same IP address from each and allowing Internet routing (BGP)to associate clients with each anycast site Nameserver replicationis recommended for all zones and IP anycast is used by most largezones such as the DNS Root and most top-level domains [23 43] IPanycast is also widely used by public resolvers recursive resolversthat are open for use by anyone on the Internet such as GooglePublic DNS [12] OpenDNS [29] Quad9 [37] and 1111 [1]

Figure 2 illustrates the relationship of elements in the DNS in-frastructure when anycast is in place The recursive resolver R isannounced using an anycast prefix from five different anycast sites(R1 to R5) Each anycast site can have multiple servers with DNStraffic designated to them by load balancing algorithms

When the stub resolver sends a DNS query to the anycast ad-dress of R BGP forwards the query to the nearest site (R3 in this

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

Anycast Recursive (R)

Stub

R1

R3

R2

R4

R5

AT

AnycastUnicast

Figure 2 Stub resolver and Anycast Recursive

example) of Rmdashnote that ldquonearestrdquo metrics can vary [8] The BGPmapping between source and anycast destination is known as any-cast catchment which under normal conditions is very stable acrossthe Internet [53] On receiving the query R3 contacts the pertinentauthoritative AT to resolve the queried domain name The commu-nication from R3 toAT is done using R3rsquos unicast address ensuringthe reply from AT is sent back to R3 instead of any other anycastsite from R On receiving the answer fromAT R3 replies to the stubresolver with the answer to its query and for this reply R3 uses itsanycast address as source

23 DNS Caching with Time-to-Live (TTLs)DNS depends on caching to reduce latency to users and load onservers Authoritatives provide responses that are then cached in ap-plications stub resolvers and recursive resolvers We next describeits loose consistency model

An authoritative resolver defines the lifetime of each result byits Time-to-Live (TTL) although TTLs is not usually exposed tousers this information is propagated through recursive resolvers

Once cached by recursive resolvers cached results cannot re-moved they can only be refreshed response by a new query andresponse after the TTL expires

Some recursive resolvers discard long-lived cache entries aftera configurable timeout BIND defaults to dropping entries after 1week [17] and Unbound after 1 day [28]

Operators select TTLs carefully Content delivery networks (CDNs)often use DNS to steer users to different content servers They selectvery short TTLs (60 seconds or less) to force clients to re-queryfrequently providing opportunities to redirect clients with DNS inresponse to changes in load or server availability [30] AlternativelyDNS data for top-level domains often has TTLs of hours or daysSuch long TTLs reduce latency for clients (the reply can be reusedimmediately if it is in the cache of a recursive resolver) and reduceload on servers for commonly used top-level domains and slowlychanging DNSSEC information

3 DNS CACHING IN CONTROLLEDEXPERIMENTS

To understand the role of caching at recursive resolvers in protec-tion during failure of authoritative servers we first must understandhow often are cache lifetimes (TTLs) honored

In the best-case scenario authoritative DNS operators may ex-pect clients to be able to reach domains under their zones even iftheir authoritative servers are unreachable for as long as cachedvalues in the recursives remain ldquovalidrdquo (ie TTL not expired) Given

the large variety of recursive implementations we pose the follow-ing question from a user point-of-view can we rely on recursivescaching when authoritatives fail

To understand cache lifetimes in practice we carry out controlledmeasurements from thousands of clients These measurements de-termine how well caches work in the field complementing our un-derstanding of how open source implementations work from theirsource code This study is important because operational softwarecan vary and large deployments often use heavily customization orclosed source implementations [48]

31 Potential Impediments to CachingAlthough DNS records should logically be cached for the full TTLa number of factors can shorten cache lifetimes in practice cachesare of limited size caches may be flushed prematurely and largeresolvers may have fragmented caches We briefly describe thesefactors here understanding how often they occur motivates themeasurements we carry out

Caches are of limited size Unbound for example defaults to a4MB limit but the values are configurable In practice DNS resultsare small enough and caches large enough that cache sizes areusually not a limiting factor Recursive resolvers may also overriderecord TTLs imposing either a minimum or maximum value [52]

Caches can be flushed explicitly (at the request of the cacheoperator) or accidentally on restart of the software or reboot of themachine running the cache

Finally some recursive resolvers handle very high request ratesmdashconsider a major ISP or public resolver [12 29 37] Large recursiveresolvers are often implemented asmany separate recursives behinda load balancer or on IP anycast In such cases the caches may befragmented with each machine operating an independent cache orthey may share a cache of common names In practice these mayreduce the cache hit rate

32 Measurement DesignTo evaluate caching we use controlled experiments where we queryfrom specific names to authoritative servers we run from thousandsof RIPE Atlas sites Our goal is to measure whether the TTL wedefine for the RRs of our controlled domain is honored acrossrecursives

Authoritative servers we deploy two authoritatives that an-swer for our new domain name (cachetestnl) We place the author-itatives on virtual machines in the same datacenter (Amazon EC2in Frankfurt Germany) each at a distinct unicast IPv4 addressesEach authoritative runs BIND 9103 Since both authoritatives arein the same datacenter they will have similar latencies to recursivesso we expect recursives to evenly distribute queries between bothauthoritative servers [27]

Vantage Points We issue queries to our controlled domainfrom around 9k RIPE Atlas probes [39] Atlas Probes are distributedacross 33k ASes with about one third hosting multiple vantagepoints (VPs) Atlas software causes each probe to issue queries toeach of its local recursive resolvers so our VPs are the tuple ofprobe and recursive The result is that we have more than 15k VPs(Table 1)

Queries and Caching We take several steps to ensure thatcaching does not interfere with queries First each query is for a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TTL 60 1800 3600 86400 3600-10minProbes 9173 9216 8971 9150 9189Probes (val) 8725 8788 8549 8750 8772Probes (disc) 448 428 422 400 417

VPs 15330 15447 15052 15345 15397Queries 94856 96095 93723 95780 191931Answers 90525 91795 89470 91495 183388Answer (val) 90079 91461 89150 91172 182731Answers (disc) 446 334 323 323 657Table 1 Caching baseline experiments [38]

name unique to the probe each probe requests an AAAA recordfor probeidcachetestnl where probeid is the probersquos the uniqueidentifier Each reply is also customized In the AAAA reply weencode three fields that are used to determine the effectiveness ofcaching (sect34) Each IPv6 address in the answer is the concatenationof four values (in hex)

prefix is a fixed 64-bit value (fd0f3897faf7a375)serial is a 8-bit value incremented every 10 minutes (zone file

rotation) allowing us to associate replies with specific queryrounds

probeid is the unique Atlas probeID [40] encoded in 8 bits toassociate the query with the reply

ttl is a 16-bit value of the TTL value we configure per experi-ment

We increment the serial number in each AAAA record and reloadthe zone (with a new zone serial number) every 10 minutes Theserial number in each reply allows us to distinguish cached resultsfrom prior rounds from fresh data in this round

Atlas DNS queries timeout after 5 seconds reporting ldquono answerrdquoWe will see this occur in our emulated DDoS events

For example a VP with probeID 1414 shall send a DNS query forAAAA record for the domain name 1414cachetestnl and shouldreceive AAAA answer in the format $PREFIX15863c where$SERIAL=1and $TTL=60

We focus onDNS over UDP on IPv4 not TCP or IPv6We use onlyIPv4 queries from Atlas Probes and serve only IPv4 authoritativesbut the IPv6 may be used inside multi-level recursives Our workcould extend to cover other protocols but we did not want tocomplicate analysis the orthogonal issue of protocol selection Wefocus on DNS over UDP because it is by far the dominant transportprotocol today (more than 97 of connections for nl [50] and mostRoot DNS servers [16])

Query Load The query rate of our experiments is designed toexplicitly test how queries intersect with TTL experimentation andnot to reproduce real-world traffic rates Popular domains such ascom will be queried much more frequently than our query rates soour results represent lower-bounds on caching In sect4 we examinecaching rates with real-world names under nl testing a range ofname popularities

TTL TTL values vary significantly in DNS with top-leveldomains typically using 1 day TTLs while CDNs often use shortTTLs of 1 or 5 minutes Given this diversity of configurations weexplicitly design experiments that cover the range from 1 minute to1 day (60 s and 86400 s TTLs) Thus rather than trying to capturea single TTL that represents all possible configurations we studya range of TTLs to explore the full range of caching behavior sect4

examines real-world traffic to provide a view of how well cachingworks with the distribution of TTLs seen in actual queries

Representativeness of Atlas Locations and Software It iswell known that the global distribution of RIPE Atlas probes isuneven Europe has far more than elsewhere [5 6 46] Althoughquantitative data analysis might be generally affected by this distri-bution bias our qualitative analysis contributions and conclusionsdo not depend on the geographical location of probes

Atlas probes use identical stub resolver software but they aredeployed in diverse locations (homes businesses universities) andso see a diverse set of recursives vendors and versions Our studytherefore represents Atlas ldquoin the wildrdquo and does not try to studyspecific software versions or vendors Although we claim our studycaptures diverse recursive resolvers we do not claim they are repre-sentative of a ldquotypicalrdquo Internet client It complements prior studieson caching by establishing what Atlas sees an baseline neededwhen we study DDoS in sect5

33 DatasetsWe carried out five experiments varying the cache lifetime (TTL)and probing frequency from the VPs Table 1 lists the parameters ofexperiments In the first four measurements the probing intervalwas fixed to 20 minutes and TTL for each AAAA was set to 601800 3600 and 86400 seconds all frequently used TTL values Forthe fifth measurement we fixed the TTL value to 3600 seconds andreduced the probing interval to 10 minutes to get better resolutionof dynamics

In each experiment queries were sent from about 9k Atlas probesWe discard 400ndash448 of these (ldquoprobes (disc)rdquo about 44 to 49 ofprobes) that do not return an answer Successful Atlas probes querymultiple recursive resolvers each a Vantage Point so each experi-ment results in about 15k VPs We also discard 323ndash657 answers(ldquoanswers (disc)rdquo about 35 to 49 of answers) because they reporterror codes (for example SERVFAIL and REFUSED [21]) or theyare referrals instead of the desired AAAA records [15] (We providemore detail about referrals in Appendix A)

Overall about 93ndash96k queries to cachetestnl from the 9k probesat 20 minute pacing and about double that with 10 minute pacingExperiments last two to three hours with no interference betweenexperiments due to use of unique names We ensure that exper-iments are isolated from each other First we space experimentsabout one day apart (details in RIPE [38]) Second the IP addresses(and their records in cachetestnl) of both authoritative name serverschange in each experiment when we restart their VMs Finally wechange the replies in the AAAA records so we can detect any staleresults (see sect32)

34 TTL distribution expected vs observedWe next investigate how often recursive resolvers honor the fullTTL provided by authoritative servers Our goal is to classify thevalid DNS answers from Table 1 into four categories based onwhere the answer comes from and where we expect it to comefrom

AA answers expected and correctly from the authoritativeCC expected and correct from a recursive cache (cache hits)AC answers from the authoritative but expected to be from

the recursiversquos cache (a cache miss)

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

TTL 60 1800 3600 86400 3600-10mAnswers (valid) 90079 91461 89150 91172 1827311-answer VPs 38 51 49 35 17Warm-up (AAi) 15292 15396 15003 15310 15380Duplicates 25 23 25 22 23Unique 15267 15373 14978 15288 15357TTL as zone 14991 15046 14703 10618 15092TTL altered 276 327 275 4670 265

AA 74435 21574 10230 681 11797CC 235 29616 39472 51667 107760CCdec 4 5 1973 4045 9589

AC 37 24645 24091 23202 47262TTL as zone 2 24584 23649 13487 43814TTL altered 35 61 442 9715 3448

CA 42 179 305 277 515CAdec 7 3 21 29 65

Table 2 Valid DNS answers (expectedobserved)

CA answers from a recursiversquos cache but expected from theauthoritative (an extended cache)

To determine if a query should be answered by the cache of therecursive we track the state of prior queries and responses and theestimated TTL Tracking state is not hard since we know the initialTTL and all queries to the zone and we encode the serial numberand the TTL in the AAAA reply (sect32)

Cold Caches and Rewriting TTLsWe first consider queriesmade against a cold cache (the first query of a unique name) totest how many recursives override the TTL We know that thishappens at some sites such as at Amazon EC2 where their virtualmachines (VMs) default recursive resolver caps all TTLs to 60 s [36]

Table 2 shows the results of our five experiments in which weclassify the valid answers from Table 1 Before classifying themwe first disregard VPs that had only one answer (1-answer VPs)since we cannot evaluate their caches status with one answer only(maximum 51 VPs out of 15000 for the experiments) Then weclassify the remaining queries asWarm-up queries AAi all of whichare type AA (expected and answered by the authoritative server)

We see some duplicate responses for these we use the times-tamp of the very first AAi received We then classify each uniqueAAi by comparing the TTL value returned by the recursive withthe expected TTL that is encoded in the AAAA answer (fixed perexperiment) The TTL as zone line counts the answers we expect toget while TTL altered shows that a few hundred recursive resolversalter the TTL If these two values differ by more than 10 we reportTTL altered

We see that the vast majority of recursives honor small TTLswith only about 2 truncating the TTL (275 to 327 of about 15000depending on the experimentrsquos TTL) We and others (sect7) see TTLtruncation from multiple ASes The exception is for queries withday-long TTLs (86400 s) where 4670 queries (30) have shortenedTTLs (Prior work also reported that many public resolvers refreshesat 1 day [51]) We conclude that wholesale TTL shortening doesnot occur for TTLs of an hour or less

TTLs with Warm Cache We next consider a warm cachemdashsubsequent queries where we believe the recursive should have theprior answer cached and classify them according to the proposedcategories (AA CC AC and CC)

Figure 3 shows a histogram of this classifications (numbersshown on Table 2) We see that most answers we receive show

0

20000

40000

60000

80000

100000

120000

60s 1800s 3600s 86400s 3600s-10m

Miss 00

Miss 326

Miss 329

Miss 309

Miss 285

rem

ain

ing q

ueries

Experiment

AACC

ACCA

Figure 3 Classification of subsequent answers with warmcache

expected caching behavior For 60 s TTLs (the left bar) we expectno queries to be cached when we re-query 20minutes (1200 s) laterand we see few cache hits (235 queries ndash CC row on Table 2 ndash whichare due to TTL rewriting to values larger than 20min) We see onlya handful of CA-type replies where we expect the authoritativeto reply and the recursive does instead We conclude that undernormal operations (with authoritatives responding) recursive re-solvers do not serve stale results (as has been proposed when theauthoritative cannot be reached [19])

For longer TTLs we see cache misses (AC responses) fractions of28 to 33 (AC(Answer_(valid) minus (1minusAnswers+Warm-up)) Mostof the AC answers did not alter the TTL (AC-over) ie the cachemiss was not due to TTL manipulations (Table 2) We do see 9715TTL modifications (about 42 of ACs) when the TTL is 1 dayTTLs (86400 s) These TTL truncations are consistent with recur-sive resolvers that limit cache durations such as caps of 7 days inBIND [17] and 1 in unbound [28] by default (We provide moredetail about TTL manipulations in Appendix B)

We conclude that DNS caches are fairly effective with cachehits about 70 of the time This estimate is likely a lower boundwe are the only users of our domain and popular domains wouldsee cache hits due to requests from other users We only see TTLtruncation for day-long TTLs This result will help us understandthe role of caching when authoritatives are under stress

35 Public Recursives and CacheFragmentation

Although we showed that most requests are cached as expectedabout 30 are not We know that many DNS requests are served bypublic recursive resolvers today several of which exist [1 12 29 37]We also know that public recursives often use anycast and loadbalancing [48] and that that can result in caches that are fragmented(not shared) across many serversWe next examine howmany cachemisses (type AC replies) are due to public recursives

Although we control queriers and authoritative servers theremay be multiple levels of recursive resolvers in between From Fig-ure 1 we see the querierrsquos first-hop recursive (R1) and the recursivethat queries the authoritative (Rn) Fortunately queries and repliesare unique so we can relate queries to the final recursive knowingthe time (the query round) and the query source For each queryq we extract the IP address of Rn and compare against a list of IP

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TTL 60 1800 3600 86400 3600-10mAC Answers 37 24645 24091 23202 47262Public R1 0 12000 11359 10869 21955Google Public R1 0 9693 9026 8585 17325other Public R1 0 2307 2333 2284 4630

Non-Public R1 37 12645 12732 12333 25307Google Public Rn 0 1196 1091 248 1708other Rn 37 11449 11641 12085 23599

Table 3 AC answers public resolver classification

addresses for 96 public recursives (Appendix C) we obtain fromDuckDuckGo search for ldquopublic dnsrdquo done on 2018-01-15

Table 3 reexamines the AC replies from Table 2 With the ex-ception of the measurements with TTL of 60 s nearly half of ACanswers (cache misses) are from queries to public R1 recursivesand about three-quarters of these are from Googlersquos Public DNSThe other half of cache misses start at non-public recursives but10 of these eventually emerge from Googlersquos DNS

Besides identifying public recursives we also see evidence ofcache fragmentation in answers from caches (CC and CA) Some-times we see serial numbers in consecutive answers decrease Forexample one VP reports serial numbers 1 3 3 7 3 3 suggestingthat it is querying different recursives one with serial 3 and anotherwith serial 7 in its cache We show these occurrences in Table 2 asCCdec and CAdec With longer TTLs we see more cache fragmen-tation with 45 of answers showing fragmentation with day-longTTLs

From these observations we conclude that cache misses resultfrom several causes (1) use of load balancers or anycast whereservers lack shared caches (2) first-level recursives that do notcache and have multiple second-level recursives and (3) cachesmay reset between the somewhat long probing interval (10 or 20minutes) Causes (1) and (2) occur in public resolvers (confirmedby Google [12]) and account for about half of the cache misses inour measurements

4 CACHING PRODUCTION ZONESIn sect3 we show that about one-third of queries do not conform withcaching expectations based on controlled experiments to our testdomain (Results may be better for caches that prioritize popularnames) We next examine this question for specific records in nlthe country code domain (ccTLD) for the Netherlands and the Root() DNS zone With traffic from ldquothe wildrdquo and a measurement targetused by millions this section uses a domain popular enough to stayin-cache at recursives

41 Requests at nlrsquos AuthoritativesWe apply this methodology to data for nl country-code top-leveldomain (ccTLD) We look specifically at the A-records for the name-servers of nl ns[1-5]dnsnl

MethodologyWe use passive observations of traffic to the nlauthoritative servers

For each target name in the zone and source (some recursiveserver identified by IP address) we build a timeseries of all requestsand compute their interarrival time ∆ Following the classificationfrom sect34 we label queries as AC if ∆ lt TTL showing an unnec-essary query to the authoritative AA if ∆ ge TTL an expected or

0 01 02 03 04 05 06 07 08 09

1

0 2000 4000 6000 8000 10000

CD

F

Δ t

Figure 4 ECDF of the median ∆t for recursives with at least5 queries to ns1-ns5dnsnl (TTL of 3600 s)

delayed cache refresh (We do not see cache hits and so there areno CC events)

Dataset At the time of our analysis (February 2018) there were8 authoritative servers for the nl zone We collect traffic for the 4unicast and one anycast authoritative servers and store the data inENTRADA [55] for analysis

Since our data for nl is incomplete and we know recursives willquery all authoritatives over time [27] our analysis represents aconservative estimate of TTL violationsmdashwe expect to miss someCA-type queries from resolvers to non-monitored authoritatives

We collect data for a period of six hours on 2018-02-22 starting at1200 UTCWe only evaluate recursives that sent at least five queriesfor our domains of interest omitting infrequent recursives (theydo not change results noticeably) We discard duplicate queries forexample a few retransmissions (less than 001 of the total queries)In total we consider more than 485k queries from 7779 differentrecursives

Results Figure 4 shows the distribution of ∆t that we observein our measurements reporting the median ∆t for any resolver thatsends at least 5 queries

About 28 of queries are frequent with an inter-arrival less than10 s and 32 of these are sent to multiple authoritatives We believethese are due to recursives submitting queries in parallel to speedup replies (perhaps the ldquoHappy Eyeballsrdquo algorithm [45])

Since these closely-timed queries are not related to recursivecaching we exclude them from analysis The remaining data is 348kqueries from 7703 different recursives

The largest peak is at 3600 s what was expected the name wasqueried and cached for the full hour TTL then the next requestcauses the name to be re-fetched These queries are all of type AA

The smaller peak around 1800 s as well as queries with othertimes less than 3600 s correspond to type AC-queriesmdashqueries thatcould have been supplied from the cache but were not 22 ofresolvers sent most of their queries within an time interval thatis less than 3600 s or even more frequent These AC queries occurbecause of TTL limiting cache fragmentation or other reasons thatclear the cache

42 Requests at the DNS RootIn this section we perform a similar analysis as for sect41 in whichwe look into DNS queries received at all Root DNS servers (except

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

075

08

085

09

095

1

5 10 15 20 25 30

CD

F

number of queries

F-Root

H-Root

All roots

Figure 5 Distribution of the number of queries for the DSrecord of nl received for each recursive Dataset DNS-OARCDITL on 2017-04-12t0000Z for 24 hours All Root serverswith similar distributions are shown in light-gray lines

G-Root) and create a distribution of the number of queries receivedper source IP address (ie per recursive)

In this analysis we use data from the DITL (Day In The Life)dataset of 2017 available at DNS-OARC [9] We look at all DNSqueries received for the DS record of the domain nl received at theRoot DNS servers along the entire day on April 12 2017 (UTC) Thisdataset consists of queries from more than 703k unique recursivesseen across all Root servers Note that the DS record for nl hasa TTL of 86400 seconds (24 hours) That is in theory one couldexpect to see just one query per recursive arriving at a given rootletter for the DS record of nl within the 24-hour interval

Each line in Figure 5 shows the distribution of the total numberof queries received at the Root servers from individual recursivesasking for the DS record of nl Besides F- and H-Root the distribu-tion is similar across all Root servers these are plotted in light-graylines F-Root shows the ldquomost friendlyrdquo behavior from recursiveswhere around 5 of them sent 5 or more queries for nl As opposedto F H-Root (dotted red line) shows the ldquoworstrdquo behavior fromrecursives where more than 10 of them sent 5 or more queriesfor nl within the 24-hour period

The solid black line in Figure 5 shows the distribution for allthe queries across all Root servers The majority (around 87) ofrecursives does send only one query within the 24-hour intervalHowever considering all Root servers we see around 13 of recur-sives that have sent multiple queries Note that the distributionsshown in Figure 5 have (very) long tails and we see up to more than218k queries from a single recursive within the 24-hour period forthe nl DS record ie roughly one query every 4 seconds from thesame IP address for the same DS record

Discussion we conclude that measurements of popular domainswithin nl (sect41) and the Roots (sect42) show that about 63 and 87of recursives honor the full TTL respectively These results areroughly in-line with our observations with RIPE Atlas (sect3)

5 THE CLIENTrsquoS VIEW OF AUTHORITATIVESUNDER DDOS

We next use controlled experiments to evaluate how DDoS attacksat authoritative DNS servers impacts client experience Our studiesof caching in controlled experiments (sect3) and passive observations(sect4) have shown that caching often works but not alwaysmdashabout

70 of controlled experiments and 30 of passive observations seefull cache lifetimes Since results of specific experiments vary wesweep the space of attack intensities to understand the range ofresponse from complete failure of authoritative servers to partialfailures

51 Emulating DDoSTo emulate DDoS attacks we begin with the same test domain(cachetestnl) we used for controlled experiments in sect3 We run anormal DNS service for some time querying from RIPE Atlas Aftercaches are warm we then simulate a DDoS attack by droppingsome fraction or all incoming DNS queries to each authoritative(We drop incoming traffic randomly with Linux iptables As suchpacket drop is not biased towards any recursive) After we begindropping traffic answers come either from caches at recursives orfor partial attacks from a lucky query that passes through

This emulation of DDoS captures traffic loss that occurs in DDoSattack as router queues overflow This emulation is not perfectsince we simulate loss at the last hop-router but in real DDoSattacks packets are often lost on access links near the target Ouremulation approximates this effect with one aggregate loss rate

DDoS attacks are also accompanied by queueing delay sincebuffers at and near the target are full We do not model queueingdelay although we do observe latency increasing due to retries Inmodern routers queueing delay due to full router buffers should beless than the retry interval In addition observations during real-world DDoS events show that the few queries that are successfulsee response times that are not much higher than typical [23]suggesting that loss (and not delay) is the dominant effect of DDoSin practice However a study that adds queueing latency to theattack model is interesting future work

52 Clients During Complete AuthoritativesFailure

We first evaluate the worst-case scenario for a DNS operator com-plete unreachability of all authoritative name servers Our goal isto understand when and for how long caches cover such an outage

Table 4 shows Experiments A B and C which simulate completefailure In Experiment A each VP makes only one query beforethe DDoS begins In Experiment B we allow several queries to takeplace and Experiment C allows several queries with a shorter TTL

Caches Protect Some We first consider Experiment A withone query that warms the cache immediately followed by the attackFigure 6a shows these responses over time with the onset of theattack the first downward arrow between 0 and 10 minutes andwith the cache expired after the second downward arrow between60 and 70 minutes We see that after the DDoS starts but before thecache has fully expired (between the downward arrows) initially30 and eventually 65 of queries fail with either no answer or aSERVFAIL error While not good this does mean that 35 to 70of queries during the DDoS are successfully served from the cacheBy contrast shortly after the cache expires almost all queries fail(only 25 VPs or 02 of the total seem to provide stale answers)

Caches Fill at Different Times In a more realistic scenarioVPs have filled their caches at different times In Experiment Acaches are freshly filled and should last for a full hour after thestart of attack Experiment B is designed for the opposite and worst

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiment ParametersTTL DDoS DDoS queries total probe failurein sec start dur before dur interval

A 3600 10 60 1 120 10 100 (both NSes)B 3600 60 60 6 240 10 100 (both NSes)C 1800 60 60 6 180 10 100 (both NSes)D 1800 60 60 6 180 10 50 (one NS)E 1800 60 60 6 180 10 50 (both NSes)F 1800 60 60 6 180 10 75 (both NSes)G 300 60 60 6 180 10 75 (both NSes)H 1800 60 60 6 180 10 90 (both NSes)I 60 60 60 6 180 10 90 (both NSes)

ResultsTotal Valid VPs Queries Total Validprobes probes answers answers

A 9224 8727 15339 136423 76619 76181B 9237 8827 15528 357102 293881 292564C 9261 8847 15578 258695 199185 198197D 9139 8708 15332 286231 273716 272231E 9153 8708 15320 285325 270179 268786F 9141 8727 15325 278741 259009 257740G 9206 8771 15481 274755 249958 249042H 9226 8778 15486 269030 242725 241569I 9224 8735 15388 253228 218831 217979

Table 4 DDoS emulation experiments [38] DDoS start durations and probe interval are given in minutes

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110

cache-only cache-expired

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment A 3600-10min-1down arrows indicate DDoS start andcache expiration

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment B 3600-10min-1down-1up arrows indicate DDoS startand recovery

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

cache-only

cache-expired

(c) Experiment C 1800-10min-1down-1up arrows indicate DDoS startcache expiration and recovery

Figure 6 Answers received during DDoS attacks

0

4000

8000

12000

16000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170answ

ers

minutes after start

AA

CC

CA

Figure 7 Timeseries of answers for Experiment B

case we begin warming the cache one hour before the attack andquery 6 times from each VP Other parameters are the same withthe attack lasting for 60 minutes (also the cache duration) but thenwe restore the authoritatives to service

Figure 6b shows the results of Experiment B While about 50 ofVPs are served from the cache in the first 10 minute round after theDDoS starts the fraction served drops quickly and is at only about3 one hour later Three factors are in play here most caches werefilled 60 minutes before the attack and are timing out in the firstround While the timeout and query rounds are both 60 minutesapart Atlas intentionally spreads queries out over 5 minutes sowe expect that some queries happen after 59 minutes and others 61minutes

Second we know some large recursives have fragmented caches(sect35) so we expect that some of the successes between times 70and 110 minutes are due to caches that were filled between times10 and 50 minutes This can actually be seen in Figure 7 where weshow a timeseries of the answers for Experiment B where we seeCC (correct cache responses) between times 60 and 90

Third we see an increase in the number of CA queries that areanswered by the cache with expired TTLs (Figure 7) This increaseis due to servers serving stale content [19]

Caches Eventually All Expire Finally we carry out a thirdemulation but with half the cache lifetime (1800 s or 30minutesrather than the full hour) Figure 6c shows response over timeThese results are similar to Experiment B with rapid fall-off whenthe attack starts as caches age After the attack has been underwayfor 30minutes all caches must have expired and we see only a few(about 26) residual successes

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

53 Discussion of Complete FailuresOverall we see that caching is partially successful in protecting dur-ing a DDoS With full valid caches half or more VPs get serviceHowever caches are filled at different times and expire so an op-erator cannot count on a full cache duration for any customerseven for popular (ldquoalways in the cacherdquo) domains The protectionprovided by caches depends on their state in the recursive resolversomething outside the operatorrsquos control In addition our evalua-tion of caching in sect3 showed that caches will end early for someVPs

Second we were surprised that a tiny fraction of VPs are suc-cessful after all caches should have timed out (after the 80 minutesperiod in Experiment A and between 90 and 110 minutes in Exper-iment C) These successes suggest an early deployment of ldquoservestalerdquo something currently under review in the IETF [19] is to servea previously known record beyond its TTL if authoritatives areunreachable with the goal of improving resilience under DDoS Weinvestigated the Experiment A where see that 1048 answers of the1140 successes in the second half of the outage These successesare from 471 VPs (and 215 recursives) most of them answered byOpenDNS and Google public DNS servers suggesting experimen-tation not yet widespread Out of these 1048 queries 1031 return aTTL value equals to 0 as specified in the IETF stale draft [19]

54 Client Reliability During PartialAuthoritative Failure

The previous section examined DDoS attacks that result in com-plete failure of all authoritatives but often DDoS attacks result inpartial failure with 50 or 90 packet loss at the authoritatives(For example consider the November 2015 DDoS attack on theDNS Root [23]) We next study experiments with partial failuresshowing that caching and retries together nearly fully protect 50DDoS events and protect half of VPs even during 90 events

We carry out several Experiments D to I in Table 4 We follow theprocedure outlined in sect51 looking at the DDoS-driven loss ratesof 50 75 and 90 with TTLs of 1800 s 300 s and 60 s Graphsomitted due to space can be found in Appendix D

Near-Full Protection from Caches During Moderate At-tacks We first consider Experiment E a ldquomildrdquo DDoS with 50loss with VP success over time in Figure 8a In spite of a loss ratethat would be crippling to TCP nearly all VPs are successful in DNSThis success is due to two factors first we know that many clientsare served from caches as was shown in Experiment A with fullloss (Figure 6a) Second most recursives retry queries so they re-cover from loss of a single packet and are able to provide an answerTogether these mean that failures during the first 30 minutes of theevent is 85 slightly higher than the 48 fraction of failures beforethe DDoS For this experiment the TTL is 1800 s (30minutes) sowe might expect failures to increase halfway through the DDoSWe do not see any increase in failures because caching and retriesare synergistic a successful retried query will place the answer in acache for a later query The importance of this result is that DNScan survive moderate-size attacks when caching is possible While apositive retries do increase latency something we study in sect55

Attack IntensityMattersWhile clients do quite well with 50loss at all authoritatives failures increase with the intensity of theattack

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

50 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment E (1800-50p-10min) 50 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

Ans

wer

sminutes after start

OK SERVFAIL No answer

(b) Experiment F (1800-75p-10min) 75 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(c) Experiment H (1800-90p-10min) 90 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(d) Experiment I (60-90p-10min) 90 packet loss

Figure 8 Answers received during DDoS attacks first andsecond vertical lines show start and end of DDoS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiments F and H shown in Figure 8b and Figure 8c increasethe loss rate to 75 and 90We see the number of failures increasesto about 190 with 75 loss and 403 with 90 loss It is importantto note that roughly 60 the clients are still served even with 90loss

We also see that this level of success is consistent over the entirehour-long DDoS event even though the cache duration is only30minutes This consistency confirms the importance of cachingand retries in combination

To verify the effects of this interaction Experiment I changesthe caching duration to 60 s less than one round or probing Com-paring Experiment I in Figure 8d to H in Figure 8c we see that thefailure rate increases from 30 to about 63 However even withno caching about 37 of queries still are answered due to resolversthat serve stale content and recursives retries We investigate retriesin sect6

55 Client Latency During Partial AuthoritativeFailure

We showed that client reliability is higher than expected duringfailures (sect54) due to a combination of caching and retries Wenext consider client latency Latency will increase during the DDoSbecause of retries and queueing delay but we will show that latencyincreases less than one might expect due to caching

To examine latency we return to Experiments D through I (Ta-ble 4) but look at latency (time to complete a query) rather thansuccess For these experiments clients timeout after 5 s

Figures 9a to 9d show latency during each emulated DDoS sce-nario (experiments with figures omitted here are in Appendix D La-tencies are not evenly distributed since some requests get throughimmediately while others must be retried one or more times so inaddition to mean we show 50 75 and 90 quantiles to characterizethe tail of the distribution

We emulate DDoS by dropping requests (sect51) and hence laten-cies reflect retries and loss but not queueing delay underrepresent-ing latency in real-world attacks However their shape (some lowlatency and a few long) is consistent with and helps explain whathas been seen in the past [23]

Beginning with Experiment E the moderate attack in Figure 9awe see no change to median latency This result is consistent withmany queries being handled by the cache and half of those nothandled by the cache getting through anyway We do see higherlatency in the 90ile tail reflecting successful retries This tail alsoincreases the mean some

This trend increases in Experiment F in Figure 9b where 75 ofqueries are lost Now we see the 75ile tail has increased as hasthe number of unanswered queries and the 90ile is twice as longas in Experiment E

We see the same latency in Experiment H with DDoS causing90 loss We set the timeouts to 5 s so the larger attack results inmore unsuccessful queries but latency for successful queries is notmuch worse than with 75 loss Median latency is still low due tocached replies

Finally Experiment I greatly reduces opportunities for cachingby reducing cache lifetime to one minute Figure 9d shows that lossof caching increases median RTT and significantly increases thetail latency Compared with Figure 9c (same packet loss ratio but

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment E 50 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160la

tency (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment F 75 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(c) Experiment H 90 packet loss(1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(d) Experiment I 90 packet loss (60 s TTL)

Figure 9 Latency results Shaded area indicates the intervalof an ongoing DDoS attack

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

1800 s TTL) we can clearly see the benefits of caching in terms oflatency (in addition to reliability) a half-hour TTL value reducedthe latency from 1300ms to 390ms Longer TTLs also help reducetail latency relative to shorter TTLs (compare for example the90ile RTT in Experiments I vs H in Figure 9)

Summary DDoS effects often increase client latency For mod-erate attacks increased latency is seen only by a few ldquounluckyrdquoclients whose do not see a full cache and whose queries are lostCaching has an important role in reducing latency during DDoSbut while it can often mitigate most reliability problems it cannotavoid latency penalties for all VPs Even when caching is not avail-able roughly 40 of clients get an answer either by serving staleor retries as we investigate next

6 THE AUTHORITATIVErsquoS PERSPECTIVEResults of partial DDoS events (sect54) show that DNS is surprisinglyreliablemdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches These results aredue to a combination of caching and retries We next examine thisfrom the perspective of the authoritative server

61 Recursive-Authoritative Traffic during aDDoS

We first ask are retries by recursive resolvers responsible for thesuccess rates observed in sect54 To investigate this question wereturn the partial DDoS experiments and look at how many queriesare sent to the authoritative servers We measure queries beforethey are dropped by our simulated DDoS Recursives must makemultiple queries to resolve a name We break out each type of queryfor the nameserver (NS) the nameserverrsquos IPv4 and v6 addresses(A-for-NS and AAAA-for-NS) and finally the desired query (AAAA-for-PID) Note that the authoritative is IPv4 only so AAAA-for-NSis non-existent and subject to negative caching while the otherrecords exist and use regular caching

We begin with the DDoS causing 75 loss in Figure 10a For thisexperiment we observe 18407 unique IP addresses of recursives(Rn ) querying for AAAA records directly to our authoritativesDuring the DDoS queries increase by about 35times We expect 4trials since the expected number of tries until success with lossrate p is (1 minus p)minus1 For this scenario results are cached for up to30 minutes so successful queries are reused in recursive cachesThis increase occurs both for the target AAAA record and alsofor the non-existent AAAA-for-NS records Negative caching forour zone is configured to 60 s making caching of NXDOMAINs forAAAA-for-NS less effective than positive caches

The offered load on the server increases further with more loss(90) as shown in Experiment H (Figure 10b) The higher loss rateresults in a much higher offered load on the server average 82timesnormal

Finally in Figure 10c we reduce the effects of caching at a 90DDoS and with a TTL of 60 s Here we see also about 81times morequeries at the server before the attack Comparing this case toExperiment H caching reduces the offered load on the server byabout 40

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(a) Experiment F 1800-75p-10min 75 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(b) Experiment H 1800-90p-10min 90 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(c) Experiment I 60-90p-10min 90 packet loss

Figure 10 Number of received queries by the authoritativeservers Shaded area indicates the interval of an ongoingDDoS attack

Implications The implication of this analysis is that legitimateclients ldquohammerrdquo with retries the already-stressed server during aDDoS For clients retries are important to get reliability and eachclient independently chooses to retry

The server is already under stress due to the DDoS so these re-tries add to that stress However the DDoS traffic is almost certainlymuch larger than the retried of legitimate traffic (A server experi-encing a volumetric attack causing 90 loss must be receiving 10timesits capacity Regular traffic is a small fraction of normal capacity soeven 4times regular is still much less than the attack traffic) The multi-plier for retried legitimate traffic depends on the implementationsstub and recursive resolver as well as application-level retries anddefection (users hitting reload in their browser and later giving up)Our experiment omits application-level retries and likely gives a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

lower bound We next examine specific recursive implementationsto see their behavior

62 Sources of Retries Software and Multi-levelRecursives

Experiments in the prior section showed that recursive resolversldquohammerrdquo authoritatives when queries are dropped We reexamineDNS software (since 2012 [56]) and additionally show deploymentsamplify retries

Recursive Software Prior work showed that recursive serversretry many times when an authoritative is unresponsive [56] withevaluation of BIND 97 and 98 DNSCache UnboundWindowsDNSand PowerDNS We studied retries in BIND 9103 and Unbound158 to quantify the number of retries Examining only requestsfor AAAA records we see that normal requests with a responsiveauthoritative ask for the AAAA records for all authoritatives andthe target name (3 total requests when there are 2 authoritatives)When all authoritatives are unavailable we see about 7times morerequests before the recursives time out (Exact numbers vary indifferent runs but typically each request is made 6 or 7 times)Such retries are appropriate provided they are paced (both useexponential backoff) they explain part of the increase in legitimatetraffic during DDoS events Full data is in Appendix E

Recursive DeploymentAnother source of extra retries is com-plex recursive deployments We showed that operators of largerecursives often use complex multi-level resolution infrastructure(sect35) This infrastructure can amplify the number of retries duringreachability problems at authoritatives

To quantify amplification we count both the number of Rn re-cursives and AAAA queries for each probe ID reaching our author-itatives Figure 11 show the results for Experiment I These valuesrepresent the amplification in two ways during stress more Rnrecursives will be used for each probe ID and these Rn will generatemore queries to the already stressed authoritatives As the figuresshow the median number of Rn recursives employed doubles (from1 to 2) during the DDoS event as does the 90ile (from 2 to 4) Themaximum rises to 39 The number of queries for each probe IDgrows more than 3times from 2 to 7 Worse the 90ile grows morethan 6times (3 queries to 18) The maximum grows 535times reachingup to 286 queries for one single probe ID This value however isa lower bound given there are a large number of A and AAAAqueries that ask for NS records and not the probe ID (AAAA andA-for NS in Figure 10)

We can also look at the aggregate effects of retries created by thecomplex recursive infrastructure Figure 12 shows the timeseries ofunique IP addresses of Rn observed at the authoritatives Before theDDoS period for Experiment I with TTL of 60 s we see a constantnumber of recursives reaching our authoritatives ie all queriesshould be answered by authoritatives (no caching at this TTL value)For experiments F and H both with TTL of 1800 s the number ofrecursives reaching our authoritative oscillates before the DDoSpeaks are observed when caches expire as expected

During the DDoS we observe a similar behavior for all threeexperiments in Figure 12 as packets are dropped at the authori-tative (at rates of 75 90 and 90 for F H and I respectively) wesee an increase on the number of Rn recursives querying our au-thoritatives for experiments F and H we see drops when caching

1

10

100

1000

0 20 40 60 80 100 120 140 160 180 1

10

100

1000

Rn-p

erP

ID

AA

AA

-for-

PID

minutes after start

Rn-per-PID-medianRn-per-PID-90-tileRn-per-PID-max

AAAA-for-PID-medianAAAA-for-PID-90-tileAAAA-for-PID-max

Figure 11 Rn recursives and AAAA queries used in Experi-ment I normalized by the number of probe IDs

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140 160 180

R

n r

eachin

g A

T

minutes after start

Experiment FExperiment HExperiment I

Figure 12 Unique Rn recursives addresses observed at au-thoritatives

is expected but not for experiment I The reason for this behavioris that the underlying layer of recursives starts forwarding queriesto other recursives which is amplified in the end (We show thisbehavior for an individual probe in Appendix F where we observethe growth in the number of queries received at the authoritativesand the number of recursives used)

Most complex resolution infrastructures are proprietary (as faras we know only one study has examined them [48]) so we cannotmake recommendations about how large recursive resolvers oughtto behave We suggest that the aggregate traffic of large recursiveresolvers should strive to be within a constant factor of singlerecursives perhaps a factor of 4 We also encourage additionalstudy of large recursive resolvers and their operators to shareinformation about their behavior

7 RELATEDWORKCaching by Recursives Several groups have shown that DNScaching can be imperfect Hao and Wang analyzed the impact ofnonce domains on DNS recursiversquos caches [13] Using two weeksof data from two universities they showed that filtering one-timedomains improves cache hit rates In two studies Pang et al [31 32]reported that web clients and local recursives do not always honorTTL values provided by authoritatives Almeida et al [2] analyzedDNS traces of a mobile operator and used a mobile applicationto see TTLS in practice They find that most domains have shortTTLs (less than 60 s) and report and evidence of TTL manipulationby recursives Schomp et al [48] demonstrate widespread use of

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

multi-level recursives by large operators as well as TTL manip-ulation Our work builds on this prior work examining cachingand TTL manipulation systematically and considering its effectson resilience

DNS client behavior Yu et al investigated how stubs and re-cursives select authoritative servers and were the first to demon-strate the large number of retries when all authoritatives are un-available [56] We also investigated how recursives select authori-tative servers in the wild and found that recursives tend to preferauthoritatives with shorter latency but query all authoritatives fordiversity [27] We confirm Yursquos work and focus on authoritativeselection during DDoS from several perspectives

Authoritatives during DDoS We investigated how the RootDNS service behaved during the Nov 2015 DDoS attacks [23] Thisreport focuses on the interactions of IP anycast and both latencyand reachability as seen from RIPE Atlas Rather than look at ag-gregate behavior and anycast our methodology here examines howclients interact with their recursive resolvers while this prior workfocused on authoritatives only bypassing recursives In additionhere we have full access to clients and authoritatives traffic dur-ing our experiments and we evaluate DDoS with controlled lossrates The prior study has incomplete data and focuses on specificresults of two events These differences stem from their study ofnatural experiments from real-world events and our controlledexperiments

8 IMPLICATIONSWe evaluated DNS resilience showing that caches and retries canmitigate much of the harm from a DDoS attack provided the cacheis full and some requests can get to authoritative servers The keyimplication of our study is to explain differences in the outcome ofrecent DDoS attacks

Recent DDoS attacks on DNS services have seen very differ-ent outcomes for users The Root Server System was a target inNov 2015 [41] and June 2016 [42] The DNS Root has 13 letterseach an authoritative ldquoserverrdquo implemented with some or many IPanycast instances Analysis of these DDoS events showed that theireffects were uneven across letters for some most or all anycastinstances showed high loss while other letters showed little or noloss [23] However the Root Operators state ldquoThere are no knownreports of end-user visible error conditions during and as a resultof this incident Because the DNS protocol is designed to cope withpartial reachability rdquo [41]

In Oct 2016 a much larger attack was directed at Dyn a providerof DNS service for many second-level domains [14] Although Dynhas a capable infrastructure and immediately took steps to addressservice problems there were reports of user-visible service disrup-tion in the technical and even popular press [34] Reports describeintermittent failure of prominent websites including ldquoTwitter Net-flix Spotify Airbnb Reddit Etsy SoundCloud and The New YorkTimesrdquo each a direct or indirect customer of Dyn at the time

Our work helps explain these very different outcomes The RootDNS saw few or no user-visible problems because data in the rootzone is cachable for a day or more and because multiple letters andmany anycast instances were continuously available (All measure-ments in this paragraph are as of 2018-05-22) Records in the rootzone have TTLs of 1 to 6 days and wwwroot-serversorg reports

922 anycast instances operating across the 13 authoritative serversDyn also operates a large infrastructure (httpsdyncomdnsnetwork-map reports 20 ldquofacilitiesrdquo) and faced a larger attack (reportsof 12 Tbs [47] compared to estimates of 35 Gbs for the Nov 2015root attack [23]) But a key difference is all of the Dynrsquos customerslisted above use DNS-based CDNs (for a description see [7]) withmultiple Dyn-hosted DNS components with TTLs that range from120 to 300 s

In addition to explaining the effects our experiments help get tothe root causes behind these outcomes Users of the Root benefitedfrom caching and saw performance like Experiment E (Figure 8a)because root contents (TLDs like com and country codes) are popu-lar and certainly cached in recursives and because some root letterswere always available to refresh caches (either through a successfulnormal query or a retry) By contrast users requiring domainswith very short TTLs (like the websites that had problems) receiveperformance more like Experiment I (Figure 8d) or Experiment C(Figure 6c) Even when some requests succeed an cache a popularname short TTLs cause caches to clear quickly

This example shows the importance of DNSrsquos multiple methodsof resilience (caching retries and at least some availability at oneauthoritative) It suggests that CDN operators may wish to considerlonger timeouts to allow caching to help and give DNS operatorsdeploy defenses Experiment H suggests 30 minutes Figure 8c

Configuring short TTLs serves a role in CDNs that use DNSto direct clients to different application-level servers Short TTLsallow for re-provisioning during DDoS attacks on web servers butthat leaves DNS servers vulnerable This tension suggests trafficscrubbing by routing changeswith longDNS TTLsmay be preferredto short DNS TTLs so that both layers can be robust Howeverthe complexity of interactions between DNS at multiple levels andCDNs suggests that more study is needed before recommendingspecific settings

Finally this evaluation helps complete our picture of DNS la-tency and reliability for DNS services that may consist of multipleauthoritatives some or all using IP anycast with multiple sites Tominimize latency prior work has shown a single authoritative usingIP anycast should maximize geographic dispersion of sites [46] Thelatency of an overall DNS service with multiple authoritatives canbe limited by the one with largest latency [27] Prior work aboutresilience to DDoS attack has shown that individual IP anycastsites will suffer under DDoS as a function of the attack traffic thatsite receives relative to its capacity [23] We show that the overallreliance of a DNS service composed of multiple authoritatives usingIP anycast tends to be as resilient as the strongest individual author-itative The reason for these opposite results is that in both casesrecursive resolvers will try all authoritatives of a given service Forlatency they will sometimes choose a distant authoritative butfor resilience they will continue until they find the most availableauthoritative

9 CONCLUSIONSThis paper represents the first study of how the DNS resolutionsystem behaves when authoritative servers are under DDoS attackCaching and retries at recursive resolvers are key factors in thisbehavior We show that together caching and retries by recursiveresolvers greatly improve the resilience of the DNS as a whole In

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 3: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

Anycast Recursive (R)

Stub

R1

R3

R2

R4

R5

AT

AnycastUnicast

Figure 2 Stub resolver and Anycast Recursive

example) of Rmdashnote that ldquonearestrdquo metrics can vary [8] The BGPmapping between source and anycast destination is known as any-cast catchment which under normal conditions is very stable acrossthe Internet [53] On receiving the query R3 contacts the pertinentauthoritative AT to resolve the queried domain name The commu-nication from R3 toAT is done using R3rsquos unicast address ensuringthe reply from AT is sent back to R3 instead of any other anycastsite from R On receiving the answer fromAT R3 replies to the stubresolver with the answer to its query and for this reply R3 uses itsanycast address as source

23 DNS Caching with Time-to-Live (TTLs)DNS depends on caching to reduce latency to users and load onservers Authoritatives provide responses that are then cached in ap-plications stub resolvers and recursive resolvers We next describeits loose consistency model

An authoritative resolver defines the lifetime of each result byits Time-to-Live (TTL) although TTLs is not usually exposed tousers this information is propagated through recursive resolvers

Once cached by recursive resolvers cached results cannot re-moved they can only be refreshed response by a new query andresponse after the TTL expires

Some recursive resolvers discard long-lived cache entries aftera configurable timeout BIND defaults to dropping entries after 1week [17] and Unbound after 1 day [28]

Operators select TTLs carefully Content delivery networks (CDNs)often use DNS to steer users to different content servers They selectvery short TTLs (60 seconds or less) to force clients to re-queryfrequently providing opportunities to redirect clients with DNS inresponse to changes in load or server availability [30] AlternativelyDNS data for top-level domains often has TTLs of hours or daysSuch long TTLs reduce latency for clients (the reply can be reusedimmediately if it is in the cache of a recursive resolver) and reduceload on servers for commonly used top-level domains and slowlychanging DNSSEC information

3 DNS CACHING IN CONTROLLEDEXPERIMENTS

To understand the role of caching at recursive resolvers in protec-tion during failure of authoritative servers we first must understandhow often are cache lifetimes (TTLs) honored

In the best-case scenario authoritative DNS operators may ex-pect clients to be able to reach domains under their zones even iftheir authoritative servers are unreachable for as long as cachedvalues in the recursives remain ldquovalidrdquo (ie TTL not expired) Given

the large variety of recursive implementations we pose the follow-ing question from a user point-of-view can we rely on recursivescaching when authoritatives fail

To understand cache lifetimes in practice we carry out controlledmeasurements from thousands of clients These measurements de-termine how well caches work in the field complementing our un-derstanding of how open source implementations work from theirsource code This study is important because operational softwarecan vary and large deployments often use heavily customization orclosed source implementations [48]

31 Potential Impediments to CachingAlthough DNS records should logically be cached for the full TTLa number of factors can shorten cache lifetimes in practice cachesare of limited size caches may be flushed prematurely and largeresolvers may have fragmented caches We briefly describe thesefactors here understanding how often they occur motivates themeasurements we carry out

Caches are of limited size Unbound for example defaults to a4MB limit but the values are configurable In practice DNS resultsare small enough and caches large enough that cache sizes areusually not a limiting factor Recursive resolvers may also overriderecord TTLs imposing either a minimum or maximum value [52]

Caches can be flushed explicitly (at the request of the cacheoperator) or accidentally on restart of the software or reboot of themachine running the cache

Finally some recursive resolvers handle very high request ratesmdashconsider a major ISP or public resolver [12 29 37] Large recursiveresolvers are often implemented asmany separate recursives behinda load balancer or on IP anycast In such cases the caches may befragmented with each machine operating an independent cache orthey may share a cache of common names In practice these mayreduce the cache hit rate

32 Measurement DesignTo evaluate caching we use controlled experiments where we queryfrom specific names to authoritative servers we run from thousandsof RIPE Atlas sites Our goal is to measure whether the TTL wedefine for the RRs of our controlled domain is honored acrossrecursives

Authoritative servers we deploy two authoritatives that an-swer for our new domain name (cachetestnl) We place the author-itatives on virtual machines in the same datacenter (Amazon EC2in Frankfurt Germany) each at a distinct unicast IPv4 addressesEach authoritative runs BIND 9103 Since both authoritatives arein the same datacenter they will have similar latencies to recursivesso we expect recursives to evenly distribute queries between bothauthoritative servers [27]

Vantage Points We issue queries to our controlled domainfrom around 9k RIPE Atlas probes [39] Atlas Probes are distributedacross 33k ASes with about one third hosting multiple vantagepoints (VPs) Atlas software causes each probe to issue queries toeach of its local recursive resolvers so our VPs are the tuple ofprobe and recursive The result is that we have more than 15k VPs(Table 1)

Queries and Caching We take several steps to ensure thatcaching does not interfere with queries First each query is for a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TTL 60 1800 3600 86400 3600-10minProbes 9173 9216 8971 9150 9189Probes (val) 8725 8788 8549 8750 8772Probes (disc) 448 428 422 400 417

VPs 15330 15447 15052 15345 15397Queries 94856 96095 93723 95780 191931Answers 90525 91795 89470 91495 183388Answer (val) 90079 91461 89150 91172 182731Answers (disc) 446 334 323 323 657Table 1 Caching baseline experiments [38]

name unique to the probe each probe requests an AAAA recordfor probeidcachetestnl where probeid is the probersquos the uniqueidentifier Each reply is also customized In the AAAA reply weencode three fields that are used to determine the effectiveness ofcaching (sect34) Each IPv6 address in the answer is the concatenationof four values (in hex)

prefix is a fixed 64-bit value (fd0f3897faf7a375)serial is a 8-bit value incremented every 10 minutes (zone file

rotation) allowing us to associate replies with specific queryrounds

probeid is the unique Atlas probeID [40] encoded in 8 bits toassociate the query with the reply

ttl is a 16-bit value of the TTL value we configure per experi-ment

We increment the serial number in each AAAA record and reloadthe zone (with a new zone serial number) every 10 minutes Theserial number in each reply allows us to distinguish cached resultsfrom prior rounds from fresh data in this round

Atlas DNS queries timeout after 5 seconds reporting ldquono answerrdquoWe will see this occur in our emulated DDoS events

For example a VP with probeID 1414 shall send a DNS query forAAAA record for the domain name 1414cachetestnl and shouldreceive AAAA answer in the format $PREFIX15863c where$SERIAL=1and $TTL=60

We focus onDNS over UDP on IPv4 not TCP or IPv6We use onlyIPv4 queries from Atlas Probes and serve only IPv4 authoritativesbut the IPv6 may be used inside multi-level recursives Our workcould extend to cover other protocols but we did not want tocomplicate analysis the orthogonal issue of protocol selection Wefocus on DNS over UDP because it is by far the dominant transportprotocol today (more than 97 of connections for nl [50] and mostRoot DNS servers [16])

Query Load The query rate of our experiments is designed toexplicitly test how queries intersect with TTL experimentation andnot to reproduce real-world traffic rates Popular domains such ascom will be queried much more frequently than our query rates soour results represent lower-bounds on caching In sect4 we examinecaching rates with real-world names under nl testing a range ofname popularities

TTL TTL values vary significantly in DNS with top-leveldomains typically using 1 day TTLs while CDNs often use shortTTLs of 1 or 5 minutes Given this diversity of configurations weexplicitly design experiments that cover the range from 1 minute to1 day (60 s and 86400 s TTLs) Thus rather than trying to capturea single TTL that represents all possible configurations we studya range of TTLs to explore the full range of caching behavior sect4

examines real-world traffic to provide a view of how well cachingworks with the distribution of TTLs seen in actual queries

Representativeness of Atlas Locations and Software It iswell known that the global distribution of RIPE Atlas probes isuneven Europe has far more than elsewhere [5 6 46] Althoughquantitative data analysis might be generally affected by this distri-bution bias our qualitative analysis contributions and conclusionsdo not depend on the geographical location of probes

Atlas probes use identical stub resolver software but they aredeployed in diverse locations (homes businesses universities) andso see a diverse set of recursives vendors and versions Our studytherefore represents Atlas ldquoin the wildrdquo and does not try to studyspecific software versions or vendors Although we claim our studycaptures diverse recursive resolvers we do not claim they are repre-sentative of a ldquotypicalrdquo Internet client It complements prior studieson caching by establishing what Atlas sees an baseline neededwhen we study DDoS in sect5

33 DatasetsWe carried out five experiments varying the cache lifetime (TTL)and probing frequency from the VPs Table 1 lists the parameters ofexperiments In the first four measurements the probing intervalwas fixed to 20 minutes and TTL for each AAAA was set to 601800 3600 and 86400 seconds all frequently used TTL values Forthe fifth measurement we fixed the TTL value to 3600 seconds andreduced the probing interval to 10 minutes to get better resolutionof dynamics

In each experiment queries were sent from about 9k Atlas probesWe discard 400ndash448 of these (ldquoprobes (disc)rdquo about 44 to 49 ofprobes) that do not return an answer Successful Atlas probes querymultiple recursive resolvers each a Vantage Point so each experi-ment results in about 15k VPs We also discard 323ndash657 answers(ldquoanswers (disc)rdquo about 35 to 49 of answers) because they reporterror codes (for example SERVFAIL and REFUSED [21]) or theyare referrals instead of the desired AAAA records [15] (We providemore detail about referrals in Appendix A)

Overall about 93ndash96k queries to cachetestnl from the 9k probesat 20 minute pacing and about double that with 10 minute pacingExperiments last two to three hours with no interference betweenexperiments due to use of unique names We ensure that exper-iments are isolated from each other First we space experimentsabout one day apart (details in RIPE [38]) Second the IP addresses(and their records in cachetestnl) of both authoritative name serverschange in each experiment when we restart their VMs Finally wechange the replies in the AAAA records so we can detect any staleresults (see sect32)

34 TTL distribution expected vs observedWe next investigate how often recursive resolvers honor the fullTTL provided by authoritative servers Our goal is to classify thevalid DNS answers from Table 1 into four categories based onwhere the answer comes from and where we expect it to comefrom

AA answers expected and correctly from the authoritativeCC expected and correct from a recursive cache (cache hits)AC answers from the authoritative but expected to be from

the recursiversquos cache (a cache miss)

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

TTL 60 1800 3600 86400 3600-10mAnswers (valid) 90079 91461 89150 91172 1827311-answer VPs 38 51 49 35 17Warm-up (AAi) 15292 15396 15003 15310 15380Duplicates 25 23 25 22 23Unique 15267 15373 14978 15288 15357TTL as zone 14991 15046 14703 10618 15092TTL altered 276 327 275 4670 265

AA 74435 21574 10230 681 11797CC 235 29616 39472 51667 107760CCdec 4 5 1973 4045 9589

AC 37 24645 24091 23202 47262TTL as zone 2 24584 23649 13487 43814TTL altered 35 61 442 9715 3448

CA 42 179 305 277 515CAdec 7 3 21 29 65

Table 2 Valid DNS answers (expectedobserved)

CA answers from a recursiversquos cache but expected from theauthoritative (an extended cache)

To determine if a query should be answered by the cache of therecursive we track the state of prior queries and responses and theestimated TTL Tracking state is not hard since we know the initialTTL and all queries to the zone and we encode the serial numberand the TTL in the AAAA reply (sect32)

Cold Caches and Rewriting TTLsWe first consider queriesmade against a cold cache (the first query of a unique name) totest how many recursives override the TTL We know that thishappens at some sites such as at Amazon EC2 where their virtualmachines (VMs) default recursive resolver caps all TTLs to 60 s [36]

Table 2 shows the results of our five experiments in which weclassify the valid answers from Table 1 Before classifying themwe first disregard VPs that had only one answer (1-answer VPs)since we cannot evaluate their caches status with one answer only(maximum 51 VPs out of 15000 for the experiments) Then weclassify the remaining queries asWarm-up queries AAi all of whichare type AA (expected and answered by the authoritative server)

We see some duplicate responses for these we use the times-tamp of the very first AAi received We then classify each uniqueAAi by comparing the TTL value returned by the recursive withthe expected TTL that is encoded in the AAAA answer (fixed perexperiment) The TTL as zone line counts the answers we expect toget while TTL altered shows that a few hundred recursive resolversalter the TTL If these two values differ by more than 10 we reportTTL altered

We see that the vast majority of recursives honor small TTLswith only about 2 truncating the TTL (275 to 327 of about 15000depending on the experimentrsquos TTL) We and others (sect7) see TTLtruncation from multiple ASes The exception is for queries withday-long TTLs (86400 s) where 4670 queries (30) have shortenedTTLs (Prior work also reported that many public resolvers refreshesat 1 day [51]) We conclude that wholesale TTL shortening doesnot occur for TTLs of an hour or less

TTLs with Warm Cache We next consider a warm cachemdashsubsequent queries where we believe the recursive should have theprior answer cached and classify them according to the proposedcategories (AA CC AC and CC)

Figure 3 shows a histogram of this classifications (numbersshown on Table 2) We see that most answers we receive show

0

20000

40000

60000

80000

100000

120000

60s 1800s 3600s 86400s 3600s-10m

Miss 00

Miss 326

Miss 329

Miss 309

Miss 285

rem

ain

ing q

ueries

Experiment

AACC

ACCA

Figure 3 Classification of subsequent answers with warmcache

expected caching behavior For 60 s TTLs (the left bar) we expectno queries to be cached when we re-query 20minutes (1200 s) laterand we see few cache hits (235 queries ndash CC row on Table 2 ndash whichare due to TTL rewriting to values larger than 20min) We see onlya handful of CA-type replies where we expect the authoritativeto reply and the recursive does instead We conclude that undernormal operations (with authoritatives responding) recursive re-solvers do not serve stale results (as has been proposed when theauthoritative cannot be reached [19])

For longer TTLs we see cache misses (AC responses) fractions of28 to 33 (AC(Answer_(valid) minus (1minusAnswers+Warm-up)) Mostof the AC answers did not alter the TTL (AC-over) ie the cachemiss was not due to TTL manipulations (Table 2) We do see 9715TTL modifications (about 42 of ACs) when the TTL is 1 dayTTLs (86400 s) These TTL truncations are consistent with recur-sive resolvers that limit cache durations such as caps of 7 days inBIND [17] and 1 in unbound [28] by default (We provide moredetail about TTL manipulations in Appendix B)

We conclude that DNS caches are fairly effective with cachehits about 70 of the time This estimate is likely a lower boundwe are the only users of our domain and popular domains wouldsee cache hits due to requests from other users We only see TTLtruncation for day-long TTLs This result will help us understandthe role of caching when authoritatives are under stress

35 Public Recursives and CacheFragmentation

Although we showed that most requests are cached as expectedabout 30 are not We know that many DNS requests are served bypublic recursive resolvers today several of which exist [1 12 29 37]We also know that public recursives often use anycast and loadbalancing [48] and that that can result in caches that are fragmented(not shared) across many serversWe next examine howmany cachemisses (type AC replies) are due to public recursives

Although we control queriers and authoritative servers theremay be multiple levels of recursive resolvers in between From Fig-ure 1 we see the querierrsquos first-hop recursive (R1) and the recursivethat queries the authoritative (Rn) Fortunately queries and repliesare unique so we can relate queries to the final recursive knowingthe time (the query round) and the query source For each queryq we extract the IP address of Rn and compare against a list of IP

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TTL 60 1800 3600 86400 3600-10mAC Answers 37 24645 24091 23202 47262Public R1 0 12000 11359 10869 21955Google Public R1 0 9693 9026 8585 17325other Public R1 0 2307 2333 2284 4630

Non-Public R1 37 12645 12732 12333 25307Google Public Rn 0 1196 1091 248 1708other Rn 37 11449 11641 12085 23599

Table 3 AC answers public resolver classification

addresses for 96 public recursives (Appendix C) we obtain fromDuckDuckGo search for ldquopublic dnsrdquo done on 2018-01-15

Table 3 reexamines the AC replies from Table 2 With the ex-ception of the measurements with TTL of 60 s nearly half of ACanswers (cache misses) are from queries to public R1 recursivesand about three-quarters of these are from Googlersquos Public DNSThe other half of cache misses start at non-public recursives but10 of these eventually emerge from Googlersquos DNS

Besides identifying public recursives we also see evidence ofcache fragmentation in answers from caches (CC and CA) Some-times we see serial numbers in consecutive answers decrease Forexample one VP reports serial numbers 1 3 3 7 3 3 suggestingthat it is querying different recursives one with serial 3 and anotherwith serial 7 in its cache We show these occurrences in Table 2 asCCdec and CAdec With longer TTLs we see more cache fragmen-tation with 45 of answers showing fragmentation with day-longTTLs

From these observations we conclude that cache misses resultfrom several causes (1) use of load balancers or anycast whereservers lack shared caches (2) first-level recursives that do notcache and have multiple second-level recursives and (3) cachesmay reset between the somewhat long probing interval (10 or 20minutes) Causes (1) and (2) occur in public resolvers (confirmedby Google [12]) and account for about half of the cache misses inour measurements

4 CACHING PRODUCTION ZONESIn sect3 we show that about one-third of queries do not conform withcaching expectations based on controlled experiments to our testdomain (Results may be better for caches that prioritize popularnames) We next examine this question for specific records in nlthe country code domain (ccTLD) for the Netherlands and the Root() DNS zone With traffic from ldquothe wildrdquo and a measurement targetused by millions this section uses a domain popular enough to stayin-cache at recursives

41 Requests at nlrsquos AuthoritativesWe apply this methodology to data for nl country-code top-leveldomain (ccTLD) We look specifically at the A-records for the name-servers of nl ns[1-5]dnsnl

MethodologyWe use passive observations of traffic to the nlauthoritative servers

For each target name in the zone and source (some recursiveserver identified by IP address) we build a timeseries of all requestsand compute their interarrival time ∆ Following the classificationfrom sect34 we label queries as AC if ∆ lt TTL showing an unnec-essary query to the authoritative AA if ∆ ge TTL an expected or

0 01 02 03 04 05 06 07 08 09

1

0 2000 4000 6000 8000 10000

CD

F

Δ t

Figure 4 ECDF of the median ∆t for recursives with at least5 queries to ns1-ns5dnsnl (TTL of 3600 s)

delayed cache refresh (We do not see cache hits and so there areno CC events)

Dataset At the time of our analysis (February 2018) there were8 authoritative servers for the nl zone We collect traffic for the 4unicast and one anycast authoritative servers and store the data inENTRADA [55] for analysis

Since our data for nl is incomplete and we know recursives willquery all authoritatives over time [27] our analysis represents aconservative estimate of TTL violationsmdashwe expect to miss someCA-type queries from resolvers to non-monitored authoritatives

We collect data for a period of six hours on 2018-02-22 starting at1200 UTCWe only evaluate recursives that sent at least five queriesfor our domains of interest omitting infrequent recursives (theydo not change results noticeably) We discard duplicate queries forexample a few retransmissions (less than 001 of the total queries)In total we consider more than 485k queries from 7779 differentrecursives

Results Figure 4 shows the distribution of ∆t that we observein our measurements reporting the median ∆t for any resolver thatsends at least 5 queries

About 28 of queries are frequent with an inter-arrival less than10 s and 32 of these are sent to multiple authoritatives We believethese are due to recursives submitting queries in parallel to speedup replies (perhaps the ldquoHappy Eyeballsrdquo algorithm [45])

Since these closely-timed queries are not related to recursivecaching we exclude them from analysis The remaining data is 348kqueries from 7703 different recursives

The largest peak is at 3600 s what was expected the name wasqueried and cached for the full hour TTL then the next requestcauses the name to be re-fetched These queries are all of type AA

The smaller peak around 1800 s as well as queries with othertimes less than 3600 s correspond to type AC-queriesmdashqueries thatcould have been supplied from the cache but were not 22 ofresolvers sent most of their queries within an time interval thatis less than 3600 s or even more frequent These AC queries occurbecause of TTL limiting cache fragmentation or other reasons thatclear the cache

42 Requests at the DNS RootIn this section we perform a similar analysis as for sect41 in whichwe look into DNS queries received at all Root DNS servers (except

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

075

08

085

09

095

1

5 10 15 20 25 30

CD

F

number of queries

F-Root

H-Root

All roots

Figure 5 Distribution of the number of queries for the DSrecord of nl received for each recursive Dataset DNS-OARCDITL on 2017-04-12t0000Z for 24 hours All Root serverswith similar distributions are shown in light-gray lines

G-Root) and create a distribution of the number of queries receivedper source IP address (ie per recursive)

In this analysis we use data from the DITL (Day In The Life)dataset of 2017 available at DNS-OARC [9] We look at all DNSqueries received for the DS record of the domain nl received at theRoot DNS servers along the entire day on April 12 2017 (UTC) Thisdataset consists of queries from more than 703k unique recursivesseen across all Root servers Note that the DS record for nl hasa TTL of 86400 seconds (24 hours) That is in theory one couldexpect to see just one query per recursive arriving at a given rootletter for the DS record of nl within the 24-hour interval

Each line in Figure 5 shows the distribution of the total numberof queries received at the Root servers from individual recursivesasking for the DS record of nl Besides F- and H-Root the distribu-tion is similar across all Root servers these are plotted in light-graylines F-Root shows the ldquomost friendlyrdquo behavior from recursiveswhere around 5 of them sent 5 or more queries for nl As opposedto F H-Root (dotted red line) shows the ldquoworstrdquo behavior fromrecursives where more than 10 of them sent 5 or more queriesfor nl within the 24-hour period

The solid black line in Figure 5 shows the distribution for allthe queries across all Root servers The majority (around 87) ofrecursives does send only one query within the 24-hour intervalHowever considering all Root servers we see around 13 of recur-sives that have sent multiple queries Note that the distributionsshown in Figure 5 have (very) long tails and we see up to more than218k queries from a single recursive within the 24-hour period forthe nl DS record ie roughly one query every 4 seconds from thesame IP address for the same DS record

Discussion we conclude that measurements of popular domainswithin nl (sect41) and the Roots (sect42) show that about 63 and 87of recursives honor the full TTL respectively These results areroughly in-line with our observations with RIPE Atlas (sect3)

5 THE CLIENTrsquoS VIEW OF AUTHORITATIVESUNDER DDOS

We next use controlled experiments to evaluate how DDoS attacksat authoritative DNS servers impacts client experience Our studiesof caching in controlled experiments (sect3) and passive observations(sect4) have shown that caching often works but not alwaysmdashabout

70 of controlled experiments and 30 of passive observations seefull cache lifetimes Since results of specific experiments vary wesweep the space of attack intensities to understand the range ofresponse from complete failure of authoritative servers to partialfailures

51 Emulating DDoSTo emulate DDoS attacks we begin with the same test domain(cachetestnl) we used for controlled experiments in sect3 We run anormal DNS service for some time querying from RIPE Atlas Aftercaches are warm we then simulate a DDoS attack by droppingsome fraction or all incoming DNS queries to each authoritative(We drop incoming traffic randomly with Linux iptables As suchpacket drop is not biased towards any recursive) After we begindropping traffic answers come either from caches at recursives orfor partial attacks from a lucky query that passes through

This emulation of DDoS captures traffic loss that occurs in DDoSattack as router queues overflow This emulation is not perfectsince we simulate loss at the last hop-router but in real DDoSattacks packets are often lost on access links near the target Ouremulation approximates this effect with one aggregate loss rate

DDoS attacks are also accompanied by queueing delay sincebuffers at and near the target are full We do not model queueingdelay although we do observe latency increasing due to retries Inmodern routers queueing delay due to full router buffers should beless than the retry interval In addition observations during real-world DDoS events show that the few queries that are successfulsee response times that are not much higher than typical [23]suggesting that loss (and not delay) is the dominant effect of DDoSin practice However a study that adds queueing latency to theattack model is interesting future work

52 Clients During Complete AuthoritativesFailure

We first evaluate the worst-case scenario for a DNS operator com-plete unreachability of all authoritative name servers Our goal isto understand when and for how long caches cover such an outage

Table 4 shows Experiments A B and C which simulate completefailure In Experiment A each VP makes only one query beforethe DDoS begins In Experiment B we allow several queries to takeplace and Experiment C allows several queries with a shorter TTL

Caches Protect Some We first consider Experiment A withone query that warms the cache immediately followed by the attackFigure 6a shows these responses over time with the onset of theattack the first downward arrow between 0 and 10 minutes andwith the cache expired after the second downward arrow between60 and 70 minutes We see that after the DDoS starts but before thecache has fully expired (between the downward arrows) initially30 and eventually 65 of queries fail with either no answer or aSERVFAIL error While not good this does mean that 35 to 70of queries during the DDoS are successfully served from the cacheBy contrast shortly after the cache expires almost all queries fail(only 25 VPs or 02 of the total seem to provide stale answers)

Caches Fill at Different Times In a more realistic scenarioVPs have filled their caches at different times In Experiment Acaches are freshly filled and should last for a full hour after thestart of attack Experiment B is designed for the opposite and worst

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiment ParametersTTL DDoS DDoS queries total probe failurein sec start dur before dur interval

A 3600 10 60 1 120 10 100 (both NSes)B 3600 60 60 6 240 10 100 (both NSes)C 1800 60 60 6 180 10 100 (both NSes)D 1800 60 60 6 180 10 50 (one NS)E 1800 60 60 6 180 10 50 (both NSes)F 1800 60 60 6 180 10 75 (both NSes)G 300 60 60 6 180 10 75 (both NSes)H 1800 60 60 6 180 10 90 (both NSes)I 60 60 60 6 180 10 90 (both NSes)

ResultsTotal Valid VPs Queries Total Validprobes probes answers answers

A 9224 8727 15339 136423 76619 76181B 9237 8827 15528 357102 293881 292564C 9261 8847 15578 258695 199185 198197D 9139 8708 15332 286231 273716 272231E 9153 8708 15320 285325 270179 268786F 9141 8727 15325 278741 259009 257740G 9206 8771 15481 274755 249958 249042H 9226 8778 15486 269030 242725 241569I 9224 8735 15388 253228 218831 217979

Table 4 DDoS emulation experiments [38] DDoS start durations and probe interval are given in minutes

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110

cache-only cache-expired

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment A 3600-10min-1down arrows indicate DDoS start andcache expiration

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment B 3600-10min-1down-1up arrows indicate DDoS startand recovery

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

cache-only

cache-expired

(c) Experiment C 1800-10min-1down-1up arrows indicate DDoS startcache expiration and recovery

Figure 6 Answers received during DDoS attacks

0

4000

8000

12000

16000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170answ

ers

minutes after start

AA

CC

CA

Figure 7 Timeseries of answers for Experiment B

case we begin warming the cache one hour before the attack andquery 6 times from each VP Other parameters are the same withthe attack lasting for 60 minutes (also the cache duration) but thenwe restore the authoritatives to service

Figure 6b shows the results of Experiment B While about 50 ofVPs are served from the cache in the first 10 minute round after theDDoS starts the fraction served drops quickly and is at only about3 one hour later Three factors are in play here most caches werefilled 60 minutes before the attack and are timing out in the firstround While the timeout and query rounds are both 60 minutesapart Atlas intentionally spreads queries out over 5 minutes sowe expect that some queries happen after 59 minutes and others 61minutes

Second we know some large recursives have fragmented caches(sect35) so we expect that some of the successes between times 70and 110 minutes are due to caches that were filled between times10 and 50 minutes This can actually be seen in Figure 7 where weshow a timeseries of the answers for Experiment B where we seeCC (correct cache responses) between times 60 and 90

Third we see an increase in the number of CA queries that areanswered by the cache with expired TTLs (Figure 7) This increaseis due to servers serving stale content [19]

Caches Eventually All Expire Finally we carry out a thirdemulation but with half the cache lifetime (1800 s or 30minutesrather than the full hour) Figure 6c shows response over timeThese results are similar to Experiment B with rapid fall-off whenthe attack starts as caches age After the attack has been underwayfor 30minutes all caches must have expired and we see only a few(about 26) residual successes

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

53 Discussion of Complete FailuresOverall we see that caching is partially successful in protecting dur-ing a DDoS With full valid caches half or more VPs get serviceHowever caches are filled at different times and expire so an op-erator cannot count on a full cache duration for any customerseven for popular (ldquoalways in the cacherdquo) domains The protectionprovided by caches depends on their state in the recursive resolversomething outside the operatorrsquos control In addition our evalua-tion of caching in sect3 showed that caches will end early for someVPs

Second we were surprised that a tiny fraction of VPs are suc-cessful after all caches should have timed out (after the 80 minutesperiod in Experiment A and between 90 and 110 minutes in Exper-iment C) These successes suggest an early deployment of ldquoservestalerdquo something currently under review in the IETF [19] is to servea previously known record beyond its TTL if authoritatives areunreachable with the goal of improving resilience under DDoS Weinvestigated the Experiment A where see that 1048 answers of the1140 successes in the second half of the outage These successesare from 471 VPs (and 215 recursives) most of them answered byOpenDNS and Google public DNS servers suggesting experimen-tation not yet widespread Out of these 1048 queries 1031 return aTTL value equals to 0 as specified in the IETF stale draft [19]

54 Client Reliability During PartialAuthoritative Failure

The previous section examined DDoS attacks that result in com-plete failure of all authoritatives but often DDoS attacks result inpartial failure with 50 or 90 packet loss at the authoritatives(For example consider the November 2015 DDoS attack on theDNS Root [23]) We next study experiments with partial failuresshowing that caching and retries together nearly fully protect 50DDoS events and protect half of VPs even during 90 events

We carry out several Experiments D to I in Table 4 We follow theprocedure outlined in sect51 looking at the DDoS-driven loss ratesof 50 75 and 90 with TTLs of 1800 s 300 s and 60 s Graphsomitted due to space can be found in Appendix D

Near-Full Protection from Caches During Moderate At-tacks We first consider Experiment E a ldquomildrdquo DDoS with 50loss with VP success over time in Figure 8a In spite of a loss ratethat would be crippling to TCP nearly all VPs are successful in DNSThis success is due to two factors first we know that many clientsare served from caches as was shown in Experiment A with fullloss (Figure 6a) Second most recursives retry queries so they re-cover from loss of a single packet and are able to provide an answerTogether these mean that failures during the first 30 minutes of theevent is 85 slightly higher than the 48 fraction of failures beforethe DDoS For this experiment the TTL is 1800 s (30minutes) sowe might expect failures to increase halfway through the DDoSWe do not see any increase in failures because caching and retriesare synergistic a successful retried query will place the answer in acache for a later query The importance of this result is that DNScan survive moderate-size attacks when caching is possible While apositive retries do increase latency something we study in sect55

Attack IntensityMattersWhile clients do quite well with 50loss at all authoritatives failures increase with the intensity of theattack

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

50 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment E (1800-50p-10min) 50 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

Ans

wer

sminutes after start

OK SERVFAIL No answer

(b) Experiment F (1800-75p-10min) 75 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(c) Experiment H (1800-90p-10min) 90 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(d) Experiment I (60-90p-10min) 90 packet loss

Figure 8 Answers received during DDoS attacks first andsecond vertical lines show start and end of DDoS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiments F and H shown in Figure 8b and Figure 8c increasethe loss rate to 75 and 90We see the number of failures increasesto about 190 with 75 loss and 403 with 90 loss It is importantto note that roughly 60 the clients are still served even with 90loss

We also see that this level of success is consistent over the entirehour-long DDoS event even though the cache duration is only30minutes This consistency confirms the importance of cachingand retries in combination

To verify the effects of this interaction Experiment I changesthe caching duration to 60 s less than one round or probing Com-paring Experiment I in Figure 8d to H in Figure 8c we see that thefailure rate increases from 30 to about 63 However even withno caching about 37 of queries still are answered due to resolversthat serve stale content and recursives retries We investigate retriesin sect6

55 Client Latency During Partial AuthoritativeFailure

We showed that client reliability is higher than expected duringfailures (sect54) due to a combination of caching and retries Wenext consider client latency Latency will increase during the DDoSbecause of retries and queueing delay but we will show that latencyincreases less than one might expect due to caching

To examine latency we return to Experiments D through I (Ta-ble 4) but look at latency (time to complete a query) rather thansuccess For these experiments clients timeout after 5 s

Figures 9a to 9d show latency during each emulated DDoS sce-nario (experiments with figures omitted here are in Appendix D La-tencies are not evenly distributed since some requests get throughimmediately while others must be retried one or more times so inaddition to mean we show 50 75 and 90 quantiles to characterizethe tail of the distribution

We emulate DDoS by dropping requests (sect51) and hence laten-cies reflect retries and loss but not queueing delay underrepresent-ing latency in real-world attacks However their shape (some lowlatency and a few long) is consistent with and helps explain whathas been seen in the past [23]

Beginning with Experiment E the moderate attack in Figure 9awe see no change to median latency This result is consistent withmany queries being handled by the cache and half of those nothandled by the cache getting through anyway We do see higherlatency in the 90ile tail reflecting successful retries This tail alsoincreases the mean some

This trend increases in Experiment F in Figure 9b where 75 ofqueries are lost Now we see the 75ile tail has increased as hasthe number of unanswered queries and the 90ile is twice as longas in Experiment E

We see the same latency in Experiment H with DDoS causing90 loss We set the timeouts to 5 s so the larger attack results inmore unsuccessful queries but latency for successful queries is notmuch worse than with 75 loss Median latency is still low due tocached replies

Finally Experiment I greatly reduces opportunities for cachingby reducing cache lifetime to one minute Figure 9d shows that lossof caching increases median RTT and significantly increases thetail latency Compared with Figure 9c (same packet loss ratio but

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment E 50 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160la

tency (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment F 75 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(c) Experiment H 90 packet loss(1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(d) Experiment I 90 packet loss (60 s TTL)

Figure 9 Latency results Shaded area indicates the intervalof an ongoing DDoS attack

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

1800 s TTL) we can clearly see the benefits of caching in terms oflatency (in addition to reliability) a half-hour TTL value reducedthe latency from 1300ms to 390ms Longer TTLs also help reducetail latency relative to shorter TTLs (compare for example the90ile RTT in Experiments I vs H in Figure 9)

Summary DDoS effects often increase client latency For mod-erate attacks increased latency is seen only by a few ldquounluckyrdquoclients whose do not see a full cache and whose queries are lostCaching has an important role in reducing latency during DDoSbut while it can often mitigate most reliability problems it cannotavoid latency penalties for all VPs Even when caching is not avail-able roughly 40 of clients get an answer either by serving staleor retries as we investigate next

6 THE AUTHORITATIVErsquoS PERSPECTIVEResults of partial DDoS events (sect54) show that DNS is surprisinglyreliablemdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches These results aredue to a combination of caching and retries We next examine thisfrom the perspective of the authoritative server

61 Recursive-Authoritative Traffic during aDDoS

We first ask are retries by recursive resolvers responsible for thesuccess rates observed in sect54 To investigate this question wereturn the partial DDoS experiments and look at how many queriesare sent to the authoritative servers We measure queries beforethey are dropped by our simulated DDoS Recursives must makemultiple queries to resolve a name We break out each type of queryfor the nameserver (NS) the nameserverrsquos IPv4 and v6 addresses(A-for-NS and AAAA-for-NS) and finally the desired query (AAAA-for-PID) Note that the authoritative is IPv4 only so AAAA-for-NSis non-existent and subject to negative caching while the otherrecords exist and use regular caching

We begin with the DDoS causing 75 loss in Figure 10a For thisexperiment we observe 18407 unique IP addresses of recursives(Rn ) querying for AAAA records directly to our authoritativesDuring the DDoS queries increase by about 35times We expect 4trials since the expected number of tries until success with lossrate p is (1 minus p)minus1 For this scenario results are cached for up to30 minutes so successful queries are reused in recursive cachesThis increase occurs both for the target AAAA record and alsofor the non-existent AAAA-for-NS records Negative caching forour zone is configured to 60 s making caching of NXDOMAINs forAAAA-for-NS less effective than positive caches

The offered load on the server increases further with more loss(90) as shown in Experiment H (Figure 10b) The higher loss rateresults in a much higher offered load on the server average 82timesnormal

Finally in Figure 10c we reduce the effects of caching at a 90DDoS and with a TTL of 60 s Here we see also about 81times morequeries at the server before the attack Comparing this case toExperiment H caching reduces the offered load on the server byabout 40

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(a) Experiment F 1800-75p-10min 75 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(b) Experiment H 1800-90p-10min 90 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(c) Experiment I 60-90p-10min 90 packet loss

Figure 10 Number of received queries by the authoritativeservers Shaded area indicates the interval of an ongoingDDoS attack

Implications The implication of this analysis is that legitimateclients ldquohammerrdquo with retries the already-stressed server during aDDoS For clients retries are important to get reliability and eachclient independently chooses to retry

The server is already under stress due to the DDoS so these re-tries add to that stress However the DDoS traffic is almost certainlymuch larger than the retried of legitimate traffic (A server experi-encing a volumetric attack causing 90 loss must be receiving 10timesits capacity Regular traffic is a small fraction of normal capacity soeven 4times regular is still much less than the attack traffic) The multi-plier for retried legitimate traffic depends on the implementationsstub and recursive resolver as well as application-level retries anddefection (users hitting reload in their browser and later giving up)Our experiment omits application-level retries and likely gives a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

lower bound We next examine specific recursive implementationsto see their behavior

62 Sources of Retries Software and Multi-levelRecursives

Experiments in the prior section showed that recursive resolversldquohammerrdquo authoritatives when queries are dropped We reexamineDNS software (since 2012 [56]) and additionally show deploymentsamplify retries

Recursive Software Prior work showed that recursive serversretry many times when an authoritative is unresponsive [56] withevaluation of BIND 97 and 98 DNSCache UnboundWindowsDNSand PowerDNS We studied retries in BIND 9103 and Unbound158 to quantify the number of retries Examining only requestsfor AAAA records we see that normal requests with a responsiveauthoritative ask for the AAAA records for all authoritatives andthe target name (3 total requests when there are 2 authoritatives)When all authoritatives are unavailable we see about 7times morerequests before the recursives time out (Exact numbers vary indifferent runs but typically each request is made 6 or 7 times)Such retries are appropriate provided they are paced (both useexponential backoff) they explain part of the increase in legitimatetraffic during DDoS events Full data is in Appendix E

Recursive DeploymentAnother source of extra retries is com-plex recursive deployments We showed that operators of largerecursives often use complex multi-level resolution infrastructure(sect35) This infrastructure can amplify the number of retries duringreachability problems at authoritatives

To quantify amplification we count both the number of Rn re-cursives and AAAA queries for each probe ID reaching our author-itatives Figure 11 show the results for Experiment I These valuesrepresent the amplification in two ways during stress more Rnrecursives will be used for each probe ID and these Rn will generatemore queries to the already stressed authoritatives As the figuresshow the median number of Rn recursives employed doubles (from1 to 2) during the DDoS event as does the 90ile (from 2 to 4) Themaximum rises to 39 The number of queries for each probe IDgrows more than 3times from 2 to 7 Worse the 90ile grows morethan 6times (3 queries to 18) The maximum grows 535times reachingup to 286 queries for one single probe ID This value however isa lower bound given there are a large number of A and AAAAqueries that ask for NS records and not the probe ID (AAAA andA-for NS in Figure 10)

We can also look at the aggregate effects of retries created by thecomplex recursive infrastructure Figure 12 shows the timeseries ofunique IP addresses of Rn observed at the authoritatives Before theDDoS period for Experiment I with TTL of 60 s we see a constantnumber of recursives reaching our authoritatives ie all queriesshould be answered by authoritatives (no caching at this TTL value)For experiments F and H both with TTL of 1800 s the number ofrecursives reaching our authoritative oscillates before the DDoSpeaks are observed when caches expire as expected

During the DDoS we observe a similar behavior for all threeexperiments in Figure 12 as packets are dropped at the authori-tative (at rates of 75 90 and 90 for F H and I respectively) wesee an increase on the number of Rn recursives querying our au-thoritatives for experiments F and H we see drops when caching

1

10

100

1000

0 20 40 60 80 100 120 140 160 180 1

10

100

1000

Rn-p

erP

ID

AA

AA

-for-

PID

minutes after start

Rn-per-PID-medianRn-per-PID-90-tileRn-per-PID-max

AAAA-for-PID-medianAAAA-for-PID-90-tileAAAA-for-PID-max

Figure 11 Rn recursives and AAAA queries used in Experi-ment I normalized by the number of probe IDs

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140 160 180

R

n r

eachin

g A

T

minutes after start

Experiment FExperiment HExperiment I

Figure 12 Unique Rn recursives addresses observed at au-thoritatives

is expected but not for experiment I The reason for this behavioris that the underlying layer of recursives starts forwarding queriesto other recursives which is amplified in the end (We show thisbehavior for an individual probe in Appendix F where we observethe growth in the number of queries received at the authoritativesand the number of recursives used)

Most complex resolution infrastructures are proprietary (as faras we know only one study has examined them [48]) so we cannotmake recommendations about how large recursive resolvers oughtto behave We suggest that the aggregate traffic of large recursiveresolvers should strive to be within a constant factor of singlerecursives perhaps a factor of 4 We also encourage additionalstudy of large recursive resolvers and their operators to shareinformation about their behavior

7 RELATEDWORKCaching by Recursives Several groups have shown that DNScaching can be imperfect Hao and Wang analyzed the impact ofnonce domains on DNS recursiversquos caches [13] Using two weeksof data from two universities they showed that filtering one-timedomains improves cache hit rates In two studies Pang et al [31 32]reported that web clients and local recursives do not always honorTTL values provided by authoritatives Almeida et al [2] analyzedDNS traces of a mobile operator and used a mobile applicationto see TTLS in practice They find that most domains have shortTTLs (less than 60 s) and report and evidence of TTL manipulationby recursives Schomp et al [48] demonstrate widespread use of

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

multi-level recursives by large operators as well as TTL manip-ulation Our work builds on this prior work examining cachingand TTL manipulation systematically and considering its effectson resilience

DNS client behavior Yu et al investigated how stubs and re-cursives select authoritative servers and were the first to demon-strate the large number of retries when all authoritatives are un-available [56] We also investigated how recursives select authori-tative servers in the wild and found that recursives tend to preferauthoritatives with shorter latency but query all authoritatives fordiversity [27] We confirm Yursquos work and focus on authoritativeselection during DDoS from several perspectives

Authoritatives during DDoS We investigated how the RootDNS service behaved during the Nov 2015 DDoS attacks [23] Thisreport focuses on the interactions of IP anycast and both latencyand reachability as seen from RIPE Atlas Rather than look at ag-gregate behavior and anycast our methodology here examines howclients interact with their recursive resolvers while this prior workfocused on authoritatives only bypassing recursives In additionhere we have full access to clients and authoritatives traffic dur-ing our experiments and we evaluate DDoS with controlled lossrates The prior study has incomplete data and focuses on specificresults of two events These differences stem from their study ofnatural experiments from real-world events and our controlledexperiments

8 IMPLICATIONSWe evaluated DNS resilience showing that caches and retries canmitigate much of the harm from a DDoS attack provided the cacheis full and some requests can get to authoritative servers The keyimplication of our study is to explain differences in the outcome ofrecent DDoS attacks

Recent DDoS attacks on DNS services have seen very differ-ent outcomes for users The Root Server System was a target inNov 2015 [41] and June 2016 [42] The DNS Root has 13 letterseach an authoritative ldquoserverrdquo implemented with some or many IPanycast instances Analysis of these DDoS events showed that theireffects were uneven across letters for some most or all anycastinstances showed high loss while other letters showed little or noloss [23] However the Root Operators state ldquoThere are no knownreports of end-user visible error conditions during and as a resultof this incident Because the DNS protocol is designed to cope withpartial reachability rdquo [41]

In Oct 2016 a much larger attack was directed at Dyn a providerof DNS service for many second-level domains [14] Although Dynhas a capable infrastructure and immediately took steps to addressservice problems there were reports of user-visible service disrup-tion in the technical and even popular press [34] Reports describeintermittent failure of prominent websites including ldquoTwitter Net-flix Spotify Airbnb Reddit Etsy SoundCloud and The New YorkTimesrdquo each a direct or indirect customer of Dyn at the time

Our work helps explain these very different outcomes The RootDNS saw few or no user-visible problems because data in the rootzone is cachable for a day or more and because multiple letters andmany anycast instances were continuously available (All measure-ments in this paragraph are as of 2018-05-22) Records in the rootzone have TTLs of 1 to 6 days and wwwroot-serversorg reports

922 anycast instances operating across the 13 authoritative serversDyn also operates a large infrastructure (httpsdyncomdnsnetwork-map reports 20 ldquofacilitiesrdquo) and faced a larger attack (reportsof 12 Tbs [47] compared to estimates of 35 Gbs for the Nov 2015root attack [23]) But a key difference is all of the Dynrsquos customerslisted above use DNS-based CDNs (for a description see [7]) withmultiple Dyn-hosted DNS components with TTLs that range from120 to 300 s

In addition to explaining the effects our experiments help get tothe root causes behind these outcomes Users of the Root benefitedfrom caching and saw performance like Experiment E (Figure 8a)because root contents (TLDs like com and country codes) are popu-lar and certainly cached in recursives and because some root letterswere always available to refresh caches (either through a successfulnormal query or a retry) By contrast users requiring domainswith very short TTLs (like the websites that had problems) receiveperformance more like Experiment I (Figure 8d) or Experiment C(Figure 6c) Even when some requests succeed an cache a popularname short TTLs cause caches to clear quickly

This example shows the importance of DNSrsquos multiple methodsof resilience (caching retries and at least some availability at oneauthoritative) It suggests that CDN operators may wish to considerlonger timeouts to allow caching to help and give DNS operatorsdeploy defenses Experiment H suggests 30 minutes Figure 8c

Configuring short TTLs serves a role in CDNs that use DNSto direct clients to different application-level servers Short TTLsallow for re-provisioning during DDoS attacks on web servers butthat leaves DNS servers vulnerable This tension suggests trafficscrubbing by routing changeswith longDNS TTLsmay be preferredto short DNS TTLs so that both layers can be robust Howeverthe complexity of interactions between DNS at multiple levels andCDNs suggests that more study is needed before recommendingspecific settings

Finally this evaluation helps complete our picture of DNS la-tency and reliability for DNS services that may consist of multipleauthoritatives some or all using IP anycast with multiple sites Tominimize latency prior work has shown a single authoritative usingIP anycast should maximize geographic dispersion of sites [46] Thelatency of an overall DNS service with multiple authoritatives canbe limited by the one with largest latency [27] Prior work aboutresilience to DDoS attack has shown that individual IP anycastsites will suffer under DDoS as a function of the attack traffic thatsite receives relative to its capacity [23] We show that the overallreliance of a DNS service composed of multiple authoritatives usingIP anycast tends to be as resilient as the strongest individual author-itative The reason for these opposite results is that in both casesrecursive resolvers will try all authoritatives of a given service Forlatency they will sometimes choose a distant authoritative butfor resilience they will continue until they find the most availableauthoritative

9 CONCLUSIONSThis paper represents the first study of how the DNS resolutionsystem behaves when authoritative servers are under DDoS attackCaching and retries at recursive resolvers are key factors in thisbehavior We show that together caching and retries by recursiveresolvers greatly improve the resilience of the DNS as a whole In

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 4: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TTL 60 1800 3600 86400 3600-10minProbes 9173 9216 8971 9150 9189Probes (val) 8725 8788 8549 8750 8772Probes (disc) 448 428 422 400 417

VPs 15330 15447 15052 15345 15397Queries 94856 96095 93723 95780 191931Answers 90525 91795 89470 91495 183388Answer (val) 90079 91461 89150 91172 182731Answers (disc) 446 334 323 323 657Table 1 Caching baseline experiments [38]

name unique to the probe each probe requests an AAAA recordfor probeidcachetestnl where probeid is the probersquos the uniqueidentifier Each reply is also customized In the AAAA reply weencode three fields that are used to determine the effectiveness ofcaching (sect34) Each IPv6 address in the answer is the concatenationof four values (in hex)

prefix is a fixed 64-bit value (fd0f3897faf7a375)serial is a 8-bit value incremented every 10 minutes (zone file

rotation) allowing us to associate replies with specific queryrounds

probeid is the unique Atlas probeID [40] encoded in 8 bits toassociate the query with the reply

ttl is a 16-bit value of the TTL value we configure per experi-ment

We increment the serial number in each AAAA record and reloadthe zone (with a new zone serial number) every 10 minutes Theserial number in each reply allows us to distinguish cached resultsfrom prior rounds from fresh data in this round

Atlas DNS queries timeout after 5 seconds reporting ldquono answerrdquoWe will see this occur in our emulated DDoS events

For example a VP with probeID 1414 shall send a DNS query forAAAA record for the domain name 1414cachetestnl and shouldreceive AAAA answer in the format $PREFIX15863c where$SERIAL=1and $TTL=60

We focus onDNS over UDP on IPv4 not TCP or IPv6We use onlyIPv4 queries from Atlas Probes and serve only IPv4 authoritativesbut the IPv6 may be used inside multi-level recursives Our workcould extend to cover other protocols but we did not want tocomplicate analysis the orthogonal issue of protocol selection Wefocus on DNS over UDP because it is by far the dominant transportprotocol today (more than 97 of connections for nl [50] and mostRoot DNS servers [16])

Query Load The query rate of our experiments is designed toexplicitly test how queries intersect with TTL experimentation andnot to reproduce real-world traffic rates Popular domains such ascom will be queried much more frequently than our query rates soour results represent lower-bounds on caching In sect4 we examinecaching rates with real-world names under nl testing a range ofname popularities

TTL TTL values vary significantly in DNS with top-leveldomains typically using 1 day TTLs while CDNs often use shortTTLs of 1 or 5 minutes Given this diversity of configurations weexplicitly design experiments that cover the range from 1 minute to1 day (60 s and 86400 s TTLs) Thus rather than trying to capturea single TTL that represents all possible configurations we studya range of TTLs to explore the full range of caching behavior sect4

examines real-world traffic to provide a view of how well cachingworks with the distribution of TTLs seen in actual queries

Representativeness of Atlas Locations and Software It iswell known that the global distribution of RIPE Atlas probes isuneven Europe has far more than elsewhere [5 6 46] Althoughquantitative data analysis might be generally affected by this distri-bution bias our qualitative analysis contributions and conclusionsdo not depend on the geographical location of probes

Atlas probes use identical stub resolver software but they aredeployed in diverse locations (homes businesses universities) andso see a diverse set of recursives vendors and versions Our studytherefore represents Atlas ldquoin the wildrdquo and does not try to studyspecific software versions or vendors Although we claim our studycaptures diverse recursive resolvers we do not claim they are repre-sentative of a ldquotypicalrdquo Internet client It complements prior studieson caching by establishing what Atlas sees an baseline neededwhen we study DDoS in sect5

33 DatasetsWe carried out five experiments varying the cache lifetime (TTL)and probing frequency from the VPs Table 1 lists the parameters ofexperiments In the first four measurements the probing intervalwas fixed to 20 minutes and TTL for each AAAA was set to 601800 3600 and 86400 seconds all frequently used TTL values Forthe fifth measurement we fixed the TTL value to 3600 seconds andreduced the probing interval to 10 minutes to get better resolutionof dynamics

In each experiment queries were sent from about 9k Atlas probesWe discard 400ndash448 of these (ldquoprobes (disc)rdquo about 44 to 49 ofprobes) that do not return an answer Successful Atlas probes querymultiple recursive resolvers each a Vantage Point so each experi-ment results in about 15k VPs We also discard 323ndash657 answers(ldquoanswers (disc)rdquo about 35 to 49 of answers) because they reporterror codes (for example SERVFAIL and REFUSED [21]) or theyare referrals instead of the desired AAAA records [15] (We providemore detail about referrals in Appendix A)

Overall about 93ndash96k queries to cachetestnl from the 9k probesat 20 minute pacing and about double that with 10 minute pacingExperiments last two to three hours with no interference betweenexperiments due to use of unique names We ensure that exper-iments are isolated from each other First we space experimentsabout one day apart (details in RIPE [38]) Second the IP addresses(and their records in cachetestnl) of both authoritative name serverschange in each experiment when we restart their VMs Finally wechange the replies in the AAAA records so we can detect any staleresults (see sect32)

34 TTL distribution expected vs observedWe next investigate how often recursive resolvers honor the fullTTL provided by authoritative servers Our goal is to classify thevalid DNS answers from Table 1 into four categories based onwhere the answer comes from and where we expect it to comefrom

AA answers expected and correctly from the authoritativeCC expected and correct from a recursive cache (cache hits)AC answers from the authoritative but expected to be from

the recursiversquos cache (a cache miss)

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

TTL 60 1800 3600 86400 3600-10mAnswers (valid) 90079 91461 89150 91172 1827311-answer VPs 38 51 49 35 17Warm-up (AAi) 15292 15396 15003 15310 15380Duplicates 25 23 25 22 23Unique 15267 15373 14978 15288 15357TTL as zone 14991 15046 14703 10618 15092TTL altered 276 327 275 4670 265

AA 74435 21574 10230 681 11797CC 235 29616 39472 51667 107760CCdec 4 5 1973 4045 9589

AC 37 24645 24091 23202 47262TTL as zone 2 24584 23649 13487 43814TTL altered 35 61 442 9715 3448

CA 42 179 305 277 515CAdec 7 3 21 29 65

Table 2 Valid DNS answers (expectedobserved)

CA answers from a recursiversquos cache but expected from theauthoritative (an extended cache)

To determine if a query should be answered by the cache of therecursive we track the state of prior queries and responses and theestimated TTL Tracking state is not hard since we know the initialTTL and all queries to the zone and we encode the serial numberand the TTL in the AAAA reply (sect32)

Cold Caches and Rewriting TTLsWe first consider queriesmade against a cold cache (the first query of a unique name) totest how many recursives override the TTL We know that thishappens at some sites such as at Amazon EC2 where their virtualmachines (VMs) default recursive resolver caps all TTLs to 60 s [36]

Table 2 shows the results of our five experiments in which weclassify the valid answers from Table 1 Before classifying themwe first disregard VPs that had only one answer (1-answer VPs)since we cannot evaluate their caches status with one answer only(maximum 51 VPs out of 15000 for the experiments) Then weclassify the remaining queries asWarm-up queries AAi all of whichare type AA (expected and answered by the authoritative server)

We see some duplicate responses for these we use the times-tamp of the very first AAi received We then classify each uniqueAAi by comparing the TTL value returned by the recursive withthe expected TTL that is encoded in the AAAA answer (fixed perexperiment) The TTL as zone line counts the answers we expect toget while TTL altered shows that a few hundred recursive resolversalter the TTL If these two values differ by more than 10 we reportTTL altered

We see that the vast majority of recursives honor small TTLswith only about 2 truncating the TTL (275 to 327 of about 15000depending on the experimentrsquos TTL) We and others (sect7) see TTLtruncation from multiple ASes The exception is for queries withday-long TTLs (86400 s) where 4670 queries (30) have shortenedTTLs (Prior work also reported that many public resolvers refreshesat 1 day [51]) We conclude that wholesale TTL shortening doesnot occur for TTLs of an hour or less

TTLs with Warm Cache We next consider a warm cachemdashsubsequent queries where we believe the recursive should have theprior answer cached and classify them according to the proposedcategories (AA CC AC and CC)

Figure 3 shows a histogram of this classifications (numbersshown on Table 2) We see that most answers we receive show

0

20000

40000

60000

80000

100000

120000

60s 1800s 3600s 86400s 3600s-10m

Miss 00

Miss 326

Miss 329

Miss 309

Miss 285

rem

ain

ing q

ueries

Experiment

AACC

ACCA

Figure 3 Classification of subsequent answers with warmcache

expected caching behavior For 60 s TTLs (the left bar) we expectno queries to be cached when we re-query 20minutes (1200 s) laterand we see few cache hits (235 queries ndash CC row on Table 2 ndash whichare due to TTL rewriting to values larger than 20min) We see onlya handful of CA-type replies where we expect the authoritativeto reply and the recursive does instead We conclude that undernormal operations (with authoritatives responding) recursive re-solvers do not serve stale results (as has been proposed when theauthoritative cannot be reached [19])

For longer TTLs we see cache misses (AC responses) fractions of28 to 33 (AC(Answer_(valid) minus (1minusAnswers+Warm-up)) Mostof the AC answers did not alter the TTL (AC-over) ie the cachemiss was not due to TTL manipulations (Table 2) We do see 9715TTL modifications (about 42 of ACs) when the TTL is 1 dayTTLs (86400 s) These TTL truncations are consistent with recur-sive resolvers that limit cache durations such as caps of 7 days inBIND [17] and 1 in unbound [28] by default (We provide moredetail about TTL manipulations in Appendix B)

We conclude that DNS caches are fairly effective with cachehits about 70 of the time This estimate is likely a lower boundwe are the only users of our domain and popular domains wouldsee cache hits due to requests from other users We only see TTLtruncation for day-long TTLs This result will help us understandthe role of caching when authoritatives are under stress

35 Public Recursives and CacheFragmentation

Although we showed that most requests are cached as expectedabout 30 are not We know that many DNS requests are served bypublic recursive resolvers today several of which exist [1 12 29 37]We also know that public recursives often use anycast and loadbalancing [48] and that that can result in caches that are fragmented(not shared) across many serversWe next examine howmany cachemisses (type AC replies) are due to public recursives

Although we control queriers and authoritative servers theremay be multiple levels of recursive resolvers in between From Fig-ure 1 we see the querierrsquos first-hop recursive (R1) and the recursivethat queries the authoritative (Rn) Fortunately queries and repliesare unique so we can relate queries to the final recursive knowingthe time (the query round) and the query source For each queryq we extract the IP address of Rn and compare against a list of IP

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TTL 60 1800 3600 86400 3600-10mAC Answers 37 24645 24091 23202 47262Public R1 0 12000 11359 10869 21955Google Public R1 0 9693 9026 8585 17325other Public R1 0 2307 2333 2284 4630

Non-Public R1 37 12645 12732 12333 25307Google Public Rn 0 1196 1091 248 1708other Rn 37 11449 11641 12085 23599

Table 3 AC answers public resolver classification

addresses for 96 public recursives (Appendix C) we obtain fromDuckDuckGo search for ldquopublic dnsrdquo done on 2018-01-15

Table 3 reexamines the AC replies from Table 2 With the ex-ception of the measurements with TTL of 60 s nearly half of ACanswers (cache misses) are from queries to public R1 recursivesand about three-quarters of these are from Googlersquos Public DNSThe other half of cache misses start at non-public recursives but10 of these eventually emerge from Googlersquos DNS

Besides identifying public recursives we also see evidence ofcache fragmentation in answers from caches (CC and CA) Some-times we see serial numbers in consecutive answers decrease Forexample one VP reports serial numbers 1 3 3 7 3 3 suggestingthat it is querying different recursives one with serial 3 and anotherwith serial 7 in its cache We show these occurrences in Table 2 asCCdec and CAdec With longer TTLs we see more cache fragmen-tation with 45 of answers showing fragmentation with day-longTTLs

From these observations we conclude that cache misses resultfrom several causes (1) use of load balancers or anycast whereservers lack shared caches (2) first-level recursives that do notcache and have multiple second-level recursives and (3) cachesmay reset between the somewhat long probing interval (10 or 20minutes) Causes (1) and (2) occur in public resolvers (confirmedby Google [12]) and account for about half of the cache misses inour measurements

4 CACHING PRODUCTION ZONESIn sect3 we show that about one-third of queries do not conform withcaching expectations based on controlled experiments to our testdomain (Results may be better for caches that prioritize popularnames) We next examine this question for specific records in nlthe country code domain (ccTLD) for the Netherlands and the Root() DNS zone With traffic from ldquothe wildrdquo and a measurement targetused by millions this section uses a domain popular enough to stayin-cache at recursives

41 Requests at nlrsquos AuthoritativesWe apply this methodology to data for nl country-code top-leveldomain (ccTLD) We look specifically at the A-records for the name-servers of nl ns[1-5]dnsnl

MethodologyWe use passive observations of traffic to the nlauthoritative servers

For each target name in the zone and source (some recursiveserver identified by IP address) we build a timeseries of all requestsand compute their interarrival time ∆ Following the classificationfrom sect34 we label queries as AC if ∆ lt TTL showing an unnec-essary query to the authoritative AA if ∆ ge TTL an expected or

0 01 02 03 04 05 06 07 08 09

1

0 2000 4000 6000 8000 10000

CD

F

Δ t

Figure 4 ECDF of the median ∆t for recursives with at least5 queries to ns1-ns5dnsnl (TTL of 3600 s)

delayed cache refresh (We do not see cache hits and so there areno CC events)

Dataset At the time of our analysis (February 2018) there were8 authoritative servers for the nl zone We collect traffic for the 4unicast and one anycast authoritative servers and store the data inENTRADA [55] for analysis

Since our data for nl is incomplete and we know recursives willquery all authoritatives over time [27] our analysis represents aconservative estimate of TTL violationsmdashwe expect to miss someCA-type queries from resolvers to non-monitored authoritatives

We collect data for a period of six hours on 2018-02-22 starting at1200 UTCWe only evaluate recursives that sent at least five queriesfor our domains of interest omitting infrequent recursives (theydo not change results noticeably) We discard duplicate queries forexample a few retransmissions (less than 001 of the total queries)In total we consider more than 485k queries from 7779 differentrecursives

Results Figure 4 shows the distribution of ∆t that we observein our measurements reporting the median ∆t for any resolver thatsends at least 5 queries

About 28 of queries are frequent with an inter-arrival less than10 s and 32 of these are sent to multiple authoritatives We believethese are due to recursives submitting queries in parallel to speedup replies (perhaps the ldquoHappy Eyeballsrdquo algorithm [45])

Since these closely-timed queries are not related to recursivecaching we exclude them from analysis The remaining data is 348kqueries from 7703 different recursives

The largest peak is at 3600 s what was expected the name wasqueried and cached for the full hour TTL then the next requestcauses the name to be re-fetched These queries are all of type AA

The smaller peak around 1800 s as well as queries with othertimes less than 3600 s correspond to type AC-queriesmdashqueries thatcould have been supplied from the cache but were not 22 ofresolvers sent most of their queries within an time interval thatis less than 3600 s or even more frequent These AC queries occurbecause of TTL limiting cache fragmentation or other reasons thatclear the cache

42 Requests at the DNS RootIn this section we perform a similar analysis as for sect41 in whichwe look into DNS queries received at all Root DNS servers (except

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

075

08

085

09

095

1

5 10 15 20 25 30

CD

F

number of queries

F-Root

H-Root

All roots

Figure 5 Distribution of the number of queries for the DSrecord of nl received for each recursive Dataset DNS-OARCDITL on 2017-04-12t0000Z for 24 hours All Root serverswith similar distributions are shown in light-gray lines

G-Root) and create a distribution of the number of queries receivedper source IP address (ie per recursive)

In this analysis we use data from the DITL (Day In The Life)dataset of 2017 available at DNS-OARC [9] We look at all DNSqueries received for the DS record of the domain nl received at theRoot DNS servers along the entire day on April 12 2017 (UTC) Thisdataset consists of queries from more than 703k unique recursivesseen across all Root servers Note that the DS record for nl hasa TTL of 86400 seconds (24 hours) That is in theory one couldexpect to see just one query per recursive arriving at a given rootletter for the DS record of nl within the 24-hour interval

Each line in Figure 5 shows the distribution of the total numberof queries received at the Root servers from individual recursivesasking for the DS record of nl Besides F- and H-Root the distribu-tion is similar across all Root servers these are plotted in light-graylines F-Root shows the ldquomost friendlyrdquo behavior from recursiveswhere around 5 of them sent 5 or more queries for nl As opposedto F H-Root (dotted red line) shows the ldquoworstrdquo behavior fromrecursives where more than 10 of them sent 5 or more queriesfor nl within the 24-hour period

The solid black line in Figure 5 shows the distribution for allthe queries across all Root servers The majority (around 87) ofrecursives does send only one query within the 24-hour intervalHowever considering all Root servers we see around 13 of recur-sives that have sent multiple queries Note that the distributionsshown in Figure 5 have (very) long tails and we see up to more than218k queries from a single recursive within the 24-hour period forthe nl DS record ie roughly one query every 4 seconds from thesame IP address for the same DS record

Discussion we conclude that measurements of popular domainswithin nl (sect41) and the Roots (sect42) show that about 63 and 87of recursives honor the full TTL respectively These results areroughly in-line with our observations with RIPE Atlas (sect3)

5 THE CLIENTrsquoS VIEW OF AUTHORITATIVESUNDER DDOS

We next use controlled experiments to evaluate how DDoS attacksat authoritative DNS servers impacts client experience Our studiesof caching in controlled experiments (sect3) and passive observations(sect4) have shown that caching often works but not alwaysmdashabout

70 of controlled experiments and 30 of passive observations seefull cache lifetimes Since results of specific experiments vary wesweep the space of attack intensities to understand the range ofresponse from complete failure of authoritative servers to partialfailures

51 Emulating DDoSTo emulate DDoS attacks we begin with the same test domain(cachetestnl) we used for controlled experiments in sect3 We run anormal DNS service for some time querying from RIPE Atlas Aftercaches are warm we then simulate a DDoS attack by droppingsome fraction or all incoming DNS queries to each authoritative(We drop incoming traffic randomly with Linux iptables As suchpacket drop is not biased towards any recursive) After we begindropping traffic answers come either from caches at recursives orfor partial attacks from a lucky query that passes through

This emulation of DDoS captures traffic loss that occurs in DDoSattack as router queues overflow This emulation is not perfectsince we simulate loss at the last hop-router but in real DDoSattacks packets are often lost on access links near the target Ouremulation approximates this effect with one aggregate loss rate

DDoS attacks are also accompanied by queueing delay sincebuffers at and near the target are full We do not model queueingdelay although we do observe latency increasing due to retries Inmodern routers queueing delay due to full router buffers should beless than the retry interval In addition observations during real-world DDoS events show that the few queries that are successfulsee response times that are not much higher than typical [23]suggesting that loss (and not delay) is the dominant effect of DDoSin practice However a study that adds queueing latency to theattack model is interesting future work

52 Clients During Complete AuthoritativesFailure

We first evaluate the worst-case scenario for a DNS operator com-plete unreachability of all authoritative name servers Our goal isto understand when and for how long caches cover such an outage

Table 4 shows Experiments A B and C which simulate completefailure In Experiment A each VP makes only one query beforethe DDoS begins In Experiment B we allow several queries to takeplace and Experiment C allows several queries with a shorter TTL

Caches Protect Some We first consider Experiment A withone query that warms the cache immediately followed by the attackFigure 6a shows these responses over time with the onset of theattack the first downward arrow between 0 and 10 minutes andwith the cache expired after the second downward arrow between60 and 70 minutes We see that after the DDoS starts but before thecache has fully expired (between the downward arrows) initially30 and eventually 65 of queries fail with either no answer or aSERVFAIL error While not good this does mean that 35 to 70of queries during the DDoS are successfully served from the cacheBy contrast shortly after the cache expires almost all queries fail(only 25 VPs or 02 of the total seem to provide stale answers)

Caches Fill at Different Times In a more realistic scenarioVPs have filled their caches at different times In Experiment Acaches are freshly filled and should last for a full hour after thestart of attack Experiment B is designed for the opposite and worst

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiment ParametersTTL DDoS DDoS queries total probe failurein sec start dur before dur interval

A 3600 10 60 1 120 10 100 (both NSes)B 3600 60 60 6 240 10 100 (both NSes)C 1800 60 60 6 180 10 100 (both NSes)D 1800 60 60 6 180 10 50 (one NS)E 1800 60 60 6 180 10 50 (both NSes)F 1800 60 60 6 180 10 75 (both NSes)G 300 60 60 6 180 10 75 (both NSes)H 1800 60 60 6 180 10 90 (both NSes)I 60 60 60 6 180 10 90 (both NSes)

ResultsTotal Valid VPs Queries Total Validprobes probes answers answers

A 9224 8727 15339 136423 76619 76181B 9237 8827 15528 357102 293881 292564C 9261 8847 15578 258695 199185 198197D 9139 8708 15332 286231 273716 272231E 9153 8708 15320 285325 270179 268786F 9141 8727 15325 278741 259009 257740G 9206 8771 15481 274755 249958 249042H 9226 8778 15486 269030 242725 241569I 9224 8735 15388 253228 218831 217979

Table 4 DDoS emulation experiments [38] DDoS start durations and probe interval are given in minutes

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110

cache-only cache-expired

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment A 3600-10min-1down arrows indicate DDoS start andcache expiration

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment B 3600-10min-1down-1up arrows indicate DDoS startand recovery

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

cache-only

cache-expired

(c) Experiment C 1800-10min-1down-1up arrows indicate DDoS startcache expiration and recovery

Figure 6 Answers received during DDoS attacks

0

4000

8000

12000

16000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170answ

ers

minutes after start

AA

CC

CA

Figure 7 Timeseries of answers for Experiment B

case we begin warming the cache one hour before the attack andquery 6 times from each VP Other parameters are the same withthe attack lasting for 60 minutes (also the cache duration) but thenwe restore the authoritatives to service

Figure 6b shows the results of Experiment B While about 50 ofVPs are served from the cache in the first 10 minute round after theDDoS starts the fraction served drops quickly and is at only about3 one hour later Three factors are in play here most caches werefilled 60 minutes before the attack and are timing out in the firstround While the timeout and query rounds are both 60 minutesapart Atlas intentionally spreads queries out over 5 minutes sowe expect that some queries happen after 59 minutes and others 61minutes

Second we know some large recursives have fragmented caches(sect35) so we expect that some of the successes between times 70and 110 minutes are due to caches that were filled between times10 and 50 minutes This can actually be seen in Figure 7 where weshow a timeseries of the answers for Experiment B where we seeCC (correct cache responses) between times 60 and 90

Third we see an increase in the number of CA queries that areanswered by the cache with expired TTLs (Figure 7) This increaseis due to servers serving stale content [19]

Caches Eventually All Expire Finally we carry out a thirdemulation but with half the cache lifetime (1800 s or 30minutesrather than the full hour) Figure 6c shows response over timeThese results are similar to Experiment B with rapid fall-off whenthe attack starts as caches age After the attack has been underwayfor 30minutes all caches must have expired and we see only a few(about 26) residual successes

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

53 Discussion of Complete FailuresOverall we see that caching is partially successful in protecting dur-ing a DDoS With full valid caches half or more VPs get serviceHowever caches are filled at different times and expire so an op-erator cannot count on a full cache duration for any customerseven for popular (ldquoalways in the cacherdquo) domains The protectionprovided by caches depends on their state in the recursive resolversomething outside the operatorrsquos control In addition our evalua-tion of caching in sect3 showed that caches will end early for someVPs

Second we were surprised that a tiny fraction of VPs are suc-cessful after all caches should have timed out (after the 80 minutesperiod in Experiment A and between 90 and 110 minutes in Exper-iment C) These successes suggest an early deployment of ldquoservestalerdquo something currently under review in the IETF [19] is to servea previously known record beyond its TTL if authoritatives areunreachable with the goal of improving resilience under DDoS Weinvestigated the Experiment A where see that 1048 answers of the1140 successes in the second half of the outage These successesare from 471 VPs (and 215 recursives) most of them answered byOpenDNS and Google public DNS servers suggesting experimen-tation not yet widespread Out of these 1048 queries 1031 return aTTL value equals to 0 as specified in the IETF stale draft [19]

54 Client Reliability During PartialAuthoritative Failure

The previous section examined DDoS attacks that result in com-plete failure of all authoritatives but often DDoS attacks result inpartial failure with 50 or 90 packet loss at the authoritatives(For example consider the November 2015 DDoS attack on theDNS Root [23]) We next study experiments with partial failuresshowing that caching and retries together nearly fully protect 50DDoS events and protect half of VPs even during 90 events

We carry out several Experiments D to I in Table 4 We follow theprocedure outlined in sect51 looking at the DDoS-driven loss ratesof 50 75 and 90 with TTLs of 1800 s 300 s and 60 s Graphsomitted due to space can be found in Appendix D

Near-Full Protection from Caches During Moderate At-tacks We first consider Experiment E a ldquomildrdquo DDoS with 50loss with VP success over time in Figure 8a In spite of a loss ratethat would be crippling to TCP nearly all VPs are successful in DNSThis success is due to two factors first we know that many clientsare served from caches as was shown in Experiment A with fullloss (Figure 6a) Second most recursives retry queries so they re-cover from loss of a single packet and are able to provide an answerTogether these mean that failures during the first 30 minutes of theevent is 85 slightly higher than the 48 fraction of failures beforethe DDoS For this experiment the TTL is 1800 s (30minutes) sowe might expect failures to increase halfway through the DDoSWe do not see any increase in failures because caching and retriesare synergistic a successful retried query will place the answer in acache for a later query The importance of this result is that DNScan survive moderate-size attacks when caching is possible While apositive retries do increase latency something we study in sect55

Attack IntensityMattersWhile clients do quite well with 50loss at all authoritatives failures increase with the intensity of theattack

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

50 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment E (1800-50p-10min) 50 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

Ans

wer

sminutes after start

OK SERVFAIL No answer

(b) Experiment F (1800-75p-10min) 75 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(c) Experiment H (1800-90p-10min) 90 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(d) Experiment I (60-90p-10min) 90 packet loss

Figure 8 Answers received during DDoS attacks first andsecond vertical lines show start and end of DDoS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiments F and H shown in Figure 8b and Figure 8c increasethe loss rate to 75 and 90We see the number of failures increasesto about 190 with 75 loss and 403 with 90 loss It is importantto note that roughly 60 the clients are still served even with 90loss

We also see that this level of success is consistent over the entirehour-long DDoS event even though the cache duration is only30minutes This consistency confirms the importance of cachingand retries in combination

To verify the effects of this interaction Experiment I changesthe caching duration to 60 s less than one round or probing Com-paring Experiment I in Figure 8d to H in Figure 8c we see that thefailure rate increases from 30 to about 63 However even withno caching about 37 of queries still are answered due to resolversthat serve stale content and recursives retries We investigate retriesin sect6

55 Client Latency During Partial AuthoritativeFailure

We showed that client reliability is higher than expected duringfailures (sect54) due to a combination of caching and retries Wenext consider client latency Latency will increase during the DDoSbecause of retries and queueing delay but we will show that latencyincreases less than one might expect due to caching

To examine latency we return to Experiments D through I (Ta-ble 4) but look at latency (time to complete a query) rather thansuccess For these experiments clients timeout after 5 s

Figures 9a to 9d show latency during each emulated DDoS sce-nario (experiments with figures omitted here are in Appendix D La-tencies are not evenly distributed since some requests get throughimmediately while others must be retried one or more times so inaddition to mean we show 50 75 and 90 quantiles to characterizethe tail of the distribution

We emulate DDoS by dropping requests (sect51) and hence laten-cies reflect retries and loss but not queueing delay underrepresent-ing latency in real-world attacks However their shape (some lowlatency and a few long) is consistent with and helps explain whathas been seen in the past [23]

Beginning with Experiment E the moderate attack in Figure 9awe see no change to median latency This result is consistent withmany queries being handled by the cache and half of those nothandled by the cache getting through anyway We do see higherlatency in the 90ile tail reflecting successful retries This tail alsoincreases the mean some

This trend increases in Experiment F in Figure 9b where 75 ofqueries are lost Now we see the 75ile tail has increased as hasthe number of unanswered queries and the 90ile is twice as longas in Experiment E

We see the same latency in Experiment H with DDoS causing90 loss We set the timeouts to 5 s so the larger attack results inmore unsuccessful queries but latency for successful queries is notmuch worse than with 75 loss Median latency is still low due tocached replies

Finally Experiment I greatly reduces opportunities for cachingby reducing cache lifetime to one minute Figure 9d shows that lossof caching increases median RTT and significantly increases thetail latency Compared with Figure 9c (same packet loss ratio but

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment E 50 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160la

tency (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment F 75 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(c) Experiment H 90 packet loss(1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(d) Experiment I 90 packet loss (60 s TTL)

Figure 9 Latency results Shaded area indicates the intervalof an ongoing DDoS attack

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

1800 s TTL) we can clearly see the benefits of caching in terms oflatency (in addition to reliability) a half-hour TTL value reducedthe latency from 1300ms to 390ms Longer TTLs also help reducetail latency relative to shorter TTLs (compare for example the90ile RTT in Experiments I vs H in Figure 9)

Summary DDoS effects often increase client latency For mod-erate attacks increased latency is seen only by a few ldquounluckyrdquoclients whose do not see a full cache and whose queries are lostCaching has an important role in reducing latency during DDoSbut while it can often mitigate most reliability problems it cannotavoid latency penalties for all VPs Even when caching is not avail-able roughly 40 of clients get an answer either by serving staleor retries as we investigate next

6 THE AUTHORITATIVErsquoS PERSPECTIVEResults of partial DDoS events (sect54) show that DNS is surprisinglyreliablemdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches These results aredue to a combination of caching and retries We next examine thisfrom the perspective of the authoritative server

61 Recursive-Authoritative Traffic during aDDoS

We first ask are retries by recursive resolvers responsible for thesuccess rates observed in sect54 To investigate this question wereturn the partial DDoS experiments and look at how many queriesare sent to the authoritative servers We measure queries beforethey are dropped by our simulated DDoS Recursives must makemultiple queries to resolve a name We break out each type of queryfor the nameserver (NS) the nameserverrsquos IPv4 and v6 addresses(A-for-NS and AAAA-for-NS) and finally the desired query (AAAA-for-PID) Note that the authoritative is IPv4 only so AAAA-for-NSis non-existent and subject to negative caching while the otherrecords exist and use regular caching

We begin with the DDoS causing 75 loss in Figure 10a For thisexperiment we observe 18407 unique IP addresses of recursives(Rn ) querying for AAAA records directly to our authoritativesDuring the DDoS queries increase by about 35times We expect 4trials since the expected number of tries until success with lossrate p is (1 minus p)minus1 For this scenario results are cached for up to30 minutes so successful queries are reused in recursive cachesThis increase occurs both for the target AAAA record and alsofor the non-existent AAAA-for-NS records Negative caching forour zone is configured to 60 s making caching of NXDOMAINs forAAAA-for-NS less effective than positive caches

The offered load on the server increases further with more loss(90) as shown in Experiment H (Figure 10b) The higher loss rateresults in a much higher offered load on the server average 82timesnormal

Finally in Figure 10c we reduce the effects of caching at a 90DDoS and with a TTL of 60 s Here we see also about 81times morequeries at the server before the attack Comparing this case toExperiment H caching reduces the offered load on the server byabout 40

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(a) Experiment F 1800-75p-10min 75 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(b) Experiment H 1800-90p-10min 90 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(c) Experiment I 60-90p-10min 90 packet loss

Figure 10 Number of received queries by the authoritativeservers Shaded area indicates the interval of an ongoingDDoS attack

Implications The implication of this analysis is that legitimateclients ldquohammerrdquo with retries the already-stressed server during aDDoS For clients retries are important to get reliability and eachclient independently chooses to retry

The server is already under stress due to the DDoS so these re-tries add to that stress However the DDoS traffic is almost certainlymuch larger than the retried of legitimate traffic (A server experi-encing a volumetric attack causing 90 loss must be receiving 10timesits capacity Regular traffic is a small fraction of normal capacity soeven 4times regular is still much less than the attack traffic) The multi-plier for retried legitimate traffic depends on the implementationsstub and recursive resolver as well as application-level retries anddefection (users hitting reload in their browser and later giving up)Our experiment omits application-level retries and likely gives a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

lower bound We next examine specific recursive implementationsto see their behavior

62 Sources of Retries Software and Multi-levelRecursives

Experiments in the prior section showed that recursive resolversldquohammerrdquo authoritatives when queries are dropped We reexamineDNS software (since 2012 [56]) and additionally show deploymentsamplify retries

Recursive Software Prior work showed that recursive serversretry many times when an authoritative is unresponsive [56] withevaluation of BIND 97 and 98 DNSCache UnboundWindowsDNSand PowerDNS We studied retries in BIND 9103 and Unbound158 to quantify the number of retries Examining only requestsfor AAAA records we see that normal requests with a responsiveauthoritative ask for the AAAA records for all authoritatives andthe target name (3 total requests when there are 2 authoritatives)When all authoritatives are unavailable we see about 7times morerequests before the recursives time out (Exact numbers vary indifferent runs but typically each request is made 6 or 7 times)Such retries are appropriate provided they are paced (both useexponential backoff) they explain part of the increase in legitimatetraffic during DDoS events Full data is in Appendix E

Recursive DeploymentAnother source of extra retries is com-plex recursive deployments We showed that operators of largerecursives often use complex multi-level resolution infrastructure(sect35) This infrastructure can amplify the number of retries duringreachability problems at authoritatives

To quantify amplification we count both the number of Rn re-cursives and AAAA queries for each probe ID reaching our author-itatives Figure 11 show the results for Experiment I These valuesrepresent the amplification in two ways during stress more Rnrecursives will be used for each probe ID and these Rn will generatemore queries to the already stressed authoritatives As the figuresshow the median number of Rn recursives employed doubles (from1 to 2) during the DDoS event as does the 90ile (from 2 to 4) Themaximum rises to 39 The number of queries for each probe IDgrows more than 3times from 2 to 7 Worse the 90ile grows morethan 6times (3 queries to 18) The maximum grows 535times reachingup to 286 queries for one single probe ID This value however isa lower bound given there are a large number of A and AAAAqueries that ask for NS records and not the probe ID (AAAA andA-for NS in Figure 10)

We can also look at the aggregate effects of retries created by thecomplex recursive infrastructure Figure 12 shows the timeseries ofunique IP addresses of Rn observed at the authoritatives Before theDDoS period for Experiment I with TTL of 60 s we see a constantnumber of recursives reaching our authoritatives ie all queriesshould be answered by authoritatives (no caching at this TTL value)For experiments F and H both with TTL of 1800 s the number ofrecursives reaching our authoritative oscillates before the DDoSpeaks are observed when caches expire as expected

During the DDoS we observe a similar behavior for all threeexperiments in Figure 12 as packets are dropped at the authori-tative (at rates of 75 90 and 90 for F H and I respectively) wesee an increase on the number of Rn recursives querying our au-thoritatives for experiments F and H we see drops when caching

1

10

100

1000

0 20 40 60 80 100 120 140 160 180 1

10

100

1000

Rn-p

erP

ID

AA

AA

-for-

PID

minutes after start

Rn-per-PID-medianRn-per-PID-90-tileRn-per-PID-max

AAAA-for-PID-medianAAAA-for-PID-90-tileAAAA-for-PID-max

Figure 11 Rn recursives and AAAA queries used in Experi-ment I normalized by the number of probe IDs

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140 160 180

R

n r

eachin

g A

T

minutes after start

Experiment FExperiment HExperiment I

Figure 12 Unique Rn recursives addresses observed at au-thoritatives

is expected but not for experiment I The reason for this behavioris that the underlying layer of recursives starts forwarding queriesto other recursives which is amplified in the end (We show thisbehavior for an individual probe in Appendix F where we observethe growth in the number of queries received at the authoritativesand the number of recursives used)

Most complex resolution infrastructures are proprietary (as faras we know only one study has examined them [48]) so we cannotmake recommendations about how large recursive resolvers oughtto behave We suggest that the aggregate traffic of large recursiveresolvers should strive to be within a constant factor of singlerecursives perhaps a factor of 4 We also encourage additionalstudy of large recursive resolvers and their operators to shareinformation about their behavior

7 RELATEDWORKCaching by Recursives Several groups have shown that DNScaching can be imperfect Hao and Wang analyzed the impact ofnonce domains on DNS recursiversquos caches [13] Using two weeksof data from two universities they showed that filtering one-timedomains improves cache hit rates In two studies Pang et al [31 32]reported that web clients and local recursives do not always honorTTL values provided by authoritatives Almeida et al [2] analyzedDNS traces of a mobile operator and used a mobile applicationto see TTLS in practice They find that most domains have shortTTLs (less than 60 s) and report and evidence of TTL manipulationby recursives Schomp et al [48] demonstrate widespread use of

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

multi-level recursives by large operators as well as TTL manip-ulation Our work builds on this prior work examining cachingand TTL manipulation systematically and considering its effectson resilience

DNS client behavior Yu et al investigated how stubs and re-cursives select authoritative servers and were the first to demon-strate the large number of retries when all authoritatives are un-available [56] We also investigated how recursives select authori-tative servers in the wild and found that recursives tend to preferauthoritatives with shorter latency but query all authoritatives fordiversity [27] We confirm Yursquos work and focus on authoritativeselection during DDoS from several perspectives

Authoritatives during DDoS We investigated how the RootDNS service behaved during the Nov 2015 DDoS attacks [23] Thisreport focuses on the interactions of IP anycast and both latencyand reachability as seen from RIPE Atlas Rather than look at ag-gregate behavior and anycast our methodology here examines howclients interact with their recursive resolvers while this prior workfocused on authoritatives only bypassing recursives In additionhere we have full access to clients and authoritatives traffic dur-ing our experiments and we evaluate DDoS with controlled lossrates The prior study has incomplete data and focuses on specificresults of two events These differences stem from their study ofnatural experiments from real-world events and our controlledexperiments

8 IMPLICATIONSWe evaluated DNS resilience showing that caches and retries canmitigate much of the harm from a DDoS attack provided the cacheis full and some requests can get to authoritative servers The keyimplication of our study is to explain differences in the outcome ofrecent DDoS attacks

Recent DDoS attacks on DNS services have seen very differ-ent outcomes for users The Root Server System was a target inNov 2015 [41] and June 2016 [42] The DNS Root has 13 letterseach an authoritative ldquoserverrdquo implemented with some or many IPanycast instances Analysis of these DDoS events showed that theireffects were uneven across letters for some most or all anycastinstances showed high loss while other letters showed little or noloss [23] However the Root Operators state ldquoThere are no knownreports of end-user visible error conditions during and as a resultof this incident Because the DNS protocol is designed to cope withpartial reachability rdquo [41]

In Oct 2016 a much larger attack was directed at Dyn a providerof DNS service for many second-level domains [14] Although Dynhas a capable infrastructure and immediately took steps to addressservice problems there were reports of user-visible service disrup-tion in the technical and even popular press [34] Reports describeintermittent failure of prominent websites including ldquoTwitter Net-flix Spotify Airbnb Reddit Etsy SoundCloud and The New YorkTimesrdquo each a direct or indirect customer of Dyn at the time

Our work helps explain these very different outcomes The RootDNS saw few or no user-visible problems because data in the rootzone is cachable for a day or more and because multiple letters andmany anycast instances were continuously available (All measure-ments in this paragraph are as of 2018-05-22) Records in the rootzone have TTLs of 1 to 6 days and wwwroot-serversorg reports

922 anycast instances operating across the 13 authoritative serversDyn also operates a large infrastructure (httpsdyncomdnsnetwork-map reports 20 ldquofacilitiesrdquo) and faced a larger attack (reportsof 12 Tbs [47] compared to estimates of 35 Gbs for the Nov 2015root attack [23]) But a key difference is all of the Dynrsquos customerslisted above use DNS-based CDNs (for a description see [7]) withmultiple Dyn-hosted DNS components with TTLs that range from120 to 300 s

In addition to explaining the effects our experiments help get tothe root causes behind these outcomes Users of the Root benefitedfrom caching and saw performance like Experiment E (Figure 8a)because root contents (TLDs like com and country codes) are popu-lar and certainly cached in recursives and because some root letterswere always available to refresh caches (either through a successfulnormal query or a retry) By contrast users requiring domainswith very short TTLs (like the websites that had problems) receiveperformance more like Experiment I (Figure 8d) or Experiment C(Figure 6c) Even when some requests succeed an cache a popularname short TTLs cause caches to clear quickly

This example shows the importance of DNSrsquos multiple methodsof resilience (caching retries and at least some availability at oneauthoritative) It suggests that CDN operators may wish to considerlonger timeouts to allow caching to help and give DNS operatorsdeploy defenses Experiment H suggests 30 minutes Figure 8c

Configuring short TTLs serves a role in CDNs that use DNSto direct clients to different application-level servers Short TTLsallow for re-provisioning during DDoS attacks on web servers butthat leaves DNS servers vulnerable This tension suggests trafficscrubbing by routing changeswith longDNS TTLsmay be preferredto short DNS TTLs so that both layers can be robust Howeverthe complexity of interactions between DNS at multiple levels andCDNs suggests that more study is needed before recommendingspecific settings

Finally this evaluation helps complete our picture of DNS la-tency and reliability for DNS services that may consist of multipleauthoritatives some or all using IP anycast with multiple sites Tominimize latency prior work has shown a single authoritative usingIP anycast should maximize geographic dispersion of sites [46] Thelatency of an overall DNS service with multiple authoritatives canbe limited by the one with largest latency [27] Prior work aboutresilience to DDoS attack has shown that individual IP anycastsites will suffer under DDoS as a function of the attack traffic thatsite receives relative to its capacity [23] We show that the overallreliance of a DNS service composed of multiple authoritatives usingIP anycast tends to be as resilient as the strongest individual author-itative The reason for these opposite results is that in both casesrecursive resolvers will try all authoritatives of a given service Forlatency they will sometimes choose a distant authoritative butfor resilience they will continue until they find the most availableauthoritative

9 CONCLUSIONSThis paper represents the first study of how the DNS resolutionsystem behaves when authoritative servers are under DDoS attackCaching and retries at recursive resolvers are key factors in thisbehavior We show that together caching and retries by recursiveresolvers greatly improve the resilience of the DNS as a whole In

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 5: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

TTL 60 1800 3600 86400 3600-10mAnswers (valid) 90079 91461 89150 91172 1827311-answer VPs 38 51 49 35 17Warm-up (AAi) 15292 15396 15003 15310 15380Duplicates 25 23 25 22 23Unique 15267 15373 14978 15288 15357TTL as zone 14991 15046 14703 10618 15092TTL altered 276 327 275 4670 265

AA 74435 21574 10230 681 11797CC 235 29616 39472 51667 107760CCdec 4 5 1973 4045 9589

AC 37 24645 24091 23202 47262TTL as zone 2 24584 23649 13487 43814TTL altered 35 61 442 9715 3448

CA 42 179 305 277 515CAdec 7 3 21 29 65

Table 2 Valid DNS answers (expectedobserved)

CA answers from a recursiversquos cache but expected from theauthoritative (an extended cache)

To determine if a query should be answered by the cache of therecursive we track the state of prior queries and responses and theestimated TTL Tracking state is not hard since we know the initialTTL and all queries to the zone and we encode the serial numberand the TTL in the AAAA reply (sect32)

Cold Caches and Rewriting TTLsWe first consider queriesmade against a cold cache (the first query of a unique name) totest how many recursives override the TTL We know that thishappens at some sites such as at Amazon EC2 where their virtualmachines (VMs) default recursive resolver caps all TTLs to 60 s [36]

Table 2 shows the results of our five experiments in which weclassify the valid answers from Table 1 Before classifying themwe first disregard VPs that had only one answer (1-answer VPs)since we cannot evaluate their caches status with one answer only(maximum 51 VPs out of 15000 for the experiments) Then weclassify the remaining queries asWarm-up queries AAi all of whichare type AA (expected and answered by the authoritative server)

We see some duplicate responses for these we use the times-tamp of the very first AAi received We then classify each uniqueAAi by comparing the TTL value returned by the recursive withthe expected TTL that is encoded in the AAAA answer (fixed perexperiment) The TTL as zone line counts the answers we expect toget while TTL altered shows that a few hundred recursive resolversalter the TTL If these two values differ by more than 10 we reportTTL altered

We see that the vast majority of recursives honor small TTLswith only about 2 truncating the TTL (275 to 327 of about 15000depending on the experimentrsquos TTL) We and others (sect7) see TTLtruncation from multiple ASes The exception is for queries withday-long TTLs (86400 s) where 4670 queries (30) have shortenedTTLs (Prior work also reported that many public resolvers refreshesat 1 day [51]) We conclude that wholesale TTL shortening doesnot occur for TTLs of an hour or less

TTLs with Warm Cache We next consider a warm cachemdashsubsequent queries where we believe the recursive should have theprior answer cached and classify them according to the proposedcategories (AA CC AC and CC)

Figure 3 shows a histogram of this classifications (numbersshown on Table 2) We see that most answers we receive show

0

20000

40000

60000

80000

100000

120000

60s 1800s 3600s 86400s 3600s-10m

Miss 00

Miss 326

Miss 329

Miss 309

Miss 285

rem

ain

ing q

ueries

Experiment

AACC

ACCA

Figure 3 Classification of subsequent answers with warmcache

expected caching behavior For 60 s TTLs (the left bar) we expectno queries to be cached when we re-query 20minutes (1200 s) laterand we see few cache hits (235 queries ndash CC row on Table 2 ndash whichare due to TTL rewriting to values larger than 20min) We see onlya handful of CA-type replies where we expect the authoritativeto reply and the recursive does instead We conclude that undernormal operations (with authoritatives responding) recursive re-solvers do not serve stale results (as has been proposed when theauthoritative cannot be reached [19])

For longer TTLs we see cache misses (AC responses) fractions of28 to 33 (AC(Answer_(valid) minus (1minusAnswers+Warm-up)) Mostof the AC answers did not alter the TTL (AC-over) ie the cachemiss was not due to TTL manipulations (Table 2) We do see 9715TTL modifications (about 42 of ACs) when the TTL is 1 dayTTLs (86400 s) These TTL truncations are consistent with recur-sive resolvers that limit cache durations such as caps of 7 days inBIND [17] and 1 in unbound [28] by default (We provide moredetail about TTL manipulations in Appendix B)

We conclude that DNS caches are fairly effective with cachehits about 70 of the time This estimate is likely a lower boundwe are the only users of our domain and popular domains wouldsee cache hits due to requests from other users We only see TTLtruncation for day-long TTLs This result will help us understandthe role of caching when authoritatives are under stress

35 Public Recursives and CacheFragmentation

Although we showed that most requests are cached as expectedabout 30 are not We know that many DNS requests are served bypublic recursive resolvers today several of which exist [1 12 29 37]We also know that public recursives often use anycast and loadbalancing [48] and that that can result in caches that are fragmented(not shared) across many serversWe next examine howmany cachemisses (type AC replies) are due to public recursives

Although we control queriers and authoritative servers theremay be multiple levels of recursive resolvers in between From Fig-ure 1 we see the querierrsquos first-hop recursive (R1) and the recursivethat queries the authoritative (Rn) Fortunately queries and repliesare unique so we can relate queries to the final recursive knowingthe time (the query round) and the query source For each queryq we extract the IP address of Rn and compare against a list of IP

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TTL 60 1800 3600 86400 3600-10mAC Answers 37 24645 24091 23202 47262Public R1 0 12000 11359 10869 21955Google Public R1 0 9693 9026 8585 17325other Public R1 0 2307 2333 2284 4630

Non-Public R1 37 12645 12732 12333 25307Google Public Rn 0 1196 1091 248 1708other Rn 37 11449 11641 12085 23599

Table 3 AC answers public resolver classification

addresses for 96 public recursives (Appendix C) we obtain fromDuckDuckGo search for ldquopublic dnsrdquo done on 2018-01-15

Table 3 reexamines the AC replies from Table 2 With the ex-ception of the measurements with TTL of 60 s nearly half of ACanswers (cache misses) are from queries to public R1 recursivesand about three-quarters of these are from Googlersquos Public DNSThe other half of cache misses start at non-public recursives but10 of these eventually emerge from Googlersquos DNS

Besides identifying public recursives we also see evidence ofcache fragmentation in answers from caches (CC and CA) Some-times we see serial numbers in consecutive answers decrease Forexample one VP reports serial numbers 1 3 3 7 3 3 suggestingthat it is querying different recursives one with serial 3 and anotherwith serial 7 in its cache We show these occurrences in Table 2 asCCdec and CAdec With longer TTLs we see more cache fragmen-tation with 45 of answers showing fragmentation with day-longTTLs

From these observations we conclude that cache misses resultfrom several causes (1) use of load balancers or anycast whereservers lack shared caches (2) first-level recursives that do notcache and have multiple second-level recursives and (3) cachesmay reset between the somewhat long probing interval (10 or 20minutes) Causes (1) and (2) occur in public resolvers (confirmedby Google [12]) and account for about half of the cache misses inour measurements

4 CACHING PRODUCTION ZONESIn sect3 we show that about one-third of queries do not conform withcaching expectations based on controlled experiments to our testdomain (Results may be better for caches that prioritize popularnames) We next examine this question for specific records in nlthe country code domain (ccTLD) for the Netherlands and the Root() DNS zone With traffic from ldquothe wildrdquo and a measurement targetused by millions this section uses a domain popular enough to stayin-cache at recursives

41 Requests at nlrsquos AuthoritativesWe apply this methodology to data for nl country-code top-leveldomain (ccTLD) We look specifically at the A-records for the name-servers of nl ns[1-5]dnsnl

MethodologyWe use passive observations of traffic to the nlauthoritative servers

For each target name in the zone and source (some recursiveserver identified by IP address) we build a timeseries of all requestsand compute their interarrival time ∆ Following the classificationfrom sect34 we label queries as AC if ∆ lt TTL showing an unnec-essary query to the authoritative AA if ∆ ge TTL an expected or

0 01 02 03 04 05 06 07 08 09

1

0 2000 4000 6000 8000 10000

CD

F

Δ t

Figure 4 ECDF of the median ∆t for recursives with at least5 queries to ns1-ns5dnsnl (TTL of 3600 s)

delayed cache refresh (We do not see cache hits and so there areno CC events)

Dataset At the time of our analysis (February 2018) there were8 authoritative servers for the nl zone We collect traffic for the 4unicast and one anycast authoritative servers and store the data inENTRADA [55] for analysis

Since our data for nl is incomplete and we know recursives willquery all authoritatives over time [27] our analysis represents aconservative estimate of TTL violationsmdashwe expect to miss someCA-type queries from resolvers to non-monitored authoritatives

We collect data for a period of six hours on 2018-02-22 starting at1200 UTCWe only evaluate recursives that sent at least five queriesfor our domains of interest omitting infrequent recursives (theydo not change results noticeably) We discard duplicate queries forexample a few retransmissions (less than 001 of the total queries)In total we consider more than 485k queries from 7779 differentrecursives

Results Figure 4 shows the distribution of ∆t that we observein our measurements reporting the median ∆t for any resolver thatsends at least 5 queries

About 28 of queries are frequent with an inter-arrival less than10 s and 32 of these are sent to multiple authoritatives We believethese are due to recursives submitting queries in parallel to speedup replies (perhaps the ldquoHappy Eyeballsrdquo algorithm [45])

Since these closely-timed queries are not related to recursivecaching we exclude them from analysis The remaining data is 348kqueries from 7703 different recursives

The largest peak is at 3600 s what was expected the name wasqueried and cached for the full hour TTL then the next requestcauses the name to be re-fetched These queries are all of type AA

The smaller peak around 1800 s as well as queries with othertimes less than 3600 s correspond to type AC-queriesmdashqueries thatcould have been supplied from the cache but were not 22 ofresolvers sent most of their queries within an time interval thatis less than 3600 s or even more frequent These AC queries occurbecause of TTL limiting cache fragmentation or other reasons thatclear the cache

42 Requests at the DNS RootIn this section we perform a similar analysis as for sect41 in whichwe look into DNS queries received at all Root DNS servers (except

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

075

08

085

09

095

1

5 10 15 20 25 30

CD

F

number of queries

F-Root

H-Root

All roots

Figure 5 Distribution of the number of queries for the DSrecord of nl received for each recursive Dataset DNS-OARCDITL on 2017-04-12t0000Z for 24 hours All Root serverswith similar distributions are shown in light-gray lines

G-Root) and create a distribution of the number of queries receivedper source IP address (ie per recursive)

In this analysis we use data from the DITL (Day In The Life)dataset of 2017 available at DNS-OARC [9] We look at all DNSqueries received for the DS record of the domain nl received at theRoot DNS servers along the entire day on April 12 2017 (UTC) Thisdataset consists of queries from more than 703k unique recursivesseen across all Root servers Note that the DS record for nl hasa TTL of 86400 seconds (24 hours) That is in theory one couldexpect to see just one query per recursive arriving at a given rootletter for the DS record of nl within the 24-hour interval

Each line in Figure 5 shows the distribution of the total numberof queries received at the Root servers from individual recursivesasking for the DS record of nl Besides F- and H-Root the distribu-tion is similar across all Root servers these are plotted in light-graylines F-Root shows the ldquomost friendlyrdquo behavior from recursiveswhere around 5 of them sent 5 or more queries for nl As opposedto F H-Root (dotted red line) shows the ldquoworstrdquo behavior fromrecursives where more than 10 of them sent 5 or more queriesfor nl within the 24-hour period

The solid black line in Figure 5 shows the distribution for allthe queries across all Root servers The majority (around 87) ofrecursives does send only one query within the 24-hour intervalHowever considering all Root servers we see around 13 of recur-sives that have sent multiple queries Note that the distributionsshown in Figure 5 have (very) long tails and we see up to more than218k queries from a single recursive within the 24-hour period forthe nl DS record ie roughly one query every 4 seconds from thesame IP address for the same DS record

Discussion we conclude that measurements of popular domainswithin nl (sect41) and the Roots (sect42) show that about 63 and 87of recursives honor the full TTL respectively These results areroughly in-line with our observations with RIPE Atlas (sect3)

5 THE CLIENTrsquoS VIEW OF AUTHORITATIVESUNDER DDOS

We next use controlled experiments to evaluate how DDoS attacksat authoritative DNS servers impacts client experience Our studiesof caching in controlled experiments (sect3) and passive observations(sect4) have shown that caching often works but not alwaysmdashabout

70 of controlled experiments and 30 of passive observations seefull cache lifetimes Since results of specific experiments vary wesweep the space of attack intensities to understand the range ofresponse from complete failure of authoritative servers to partialfailures

51 Emulating DDoSTo emulate DDoS attacks we begin with the same test domain(cachetestnl) we used for controlled experiments in sect3 We run anormal DNS service for some time querying from RIPE Atlas Aftercaches are warm we then simulate a DDoS attack by droppingsome fraction or all incoming DNS queries to each authoritative(We drop incoming traffic randomly with Linux iptables As suchpacket drop is not biased towards any recursive) After we begindropping traffic answers come either from caches at recursives orfor partial attacks from a lucky query that passes through

This emulation of DDoS captures traffic loss that occurs in DDoSattack as router queues overflow This emulation is not perfectsince we simulate loss at the last hop-router but in real DDoSattacks packets are often lost on access links near the target Ouremulation approximates this effect with one aggregate loss rate

DDoS attacks are also accompanied by queueing delay sincebuffers at and near the target are full We do not model queueingdelay although we do observe latency increasing due to retries Inmodern routers queueing delay due to full router buffers should beless than the retry interval In addition observations during real-world DDoS events show that the few queries that are successfulsee response times that are not much higher than typical [23]suggesting that loss (and not delay) is the dominant effect of DDoSin practice However a study that adds queueing latency to theattack model is interesting future work

52 Clients During Complete AuthoritativesFailure

We first evaluate the worst-case scenario for a DNS operator com-plete unreachability of all authoritative name servers Our goal isto understand when and for how long caches cover such an outage

Table 4 shows Experiments A B and C which simulate completefailure In Experiment A each VP makes only one query beforethe DDoS begins In Experiment B we allow several queries to takeplace and Experiment C allows several queries with a shorter TTL

Caches Protect Some We first consider Experiment A withone query that warms the cache immediately followed by the attackFigure 6a shows these responses over time with the onset of theattack the first downward arrow between 0 and 10 minutes andwith the cache expired after the second downward arrow between60 and 70 minutes We see that after the DDoS starts but before thecache has fully expired (between the downward arrows) initially30 and eventually 65 of queries fail with either no answer or aSERVFAIL error While not good this does mean that 35 to 70of queries during the DDoS are successfully served from the cacheBy contrast shortly after the cache expires almost all queries fail(only 25 VPs or 02 of the total seem to provide stale answers)

Caches Fill at Different Times In a more realistic scenarioVPs have filled their caches at different times In Experiment Acaches are freshly filled and should last for a full hour after thestart of attack Experiment B is designed for the opposite and worst

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiment ParametersTTL DDoS DDoS queries total probe failurein sec start dur before dur interval

A 3600 10 60 1 120 10 100 (both NSes)B 3600 60 60 6 240 10 100 (both NSes)C 1800 60 60 6 180 10 100 (both NSes)D 1800 60 60 6 180 10 50 (one NS)E 1800 60 60 6 180 10 50 (both NSes)F 1800 60 60 6 180 10 75 (both NSes)G 300 60 60 6 180 10 75 (both NSes)H 1800 60 60 6 180 10 90 (both NSes)I 60 60 60 6 180 10 90 (both NSes)

ResultsTotal Valid VPs Queries Total Validprobes probes answers answers

A 9224 8727 15339 136423 76619 76181B 9237 8827 15528 357102 293881 292564C 9261 8847 15578 258695 199185 198197D 9139 8708 15332 286231 273716 272231E 9153 8708 15320 285325 270179 268786F 9141 8727 15325 278741 259009 257740G 9206 8771 15481 274755 249958 249042H 9226 8778 15486 269030 242725 241569I 9224 8735 15388 253228 218831 217979

Table 4 DDoS emulation experiments [38] DDoS start durations and probe interval are given in minutes

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110

cache-only cache-expired

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment A 3600-10min-1down arrows indicate DDoS start andcache expiration

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment B 3600-10min-1down-1up arrows indicate DDoS startand recovery

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

cache-only

cache-expired

(c) Experiment C 1800-10min-1down-1up arrows indicate DDoS startcache expiration and recovery

Figure 6 Answers received during DDoS attacks

0

4000

8000

12000

16000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170answ

ers

minutes after start

AA

CC

CA

Figure 7 Timeseries of answers for Experiment B

case we begin warming the cache one hour before the attack andquery 6 times from each VP Other parameters are the same withthe attack lasting for 60 minutes (also the cache duration) but thenwe restore the authoritatives to service

Figure 6b shows the results of Experiment B While about 50 ofVPs are served from the cache in the first 10 minute round after theDDoS starts the fraction served drops quickly and is at only about3 one hour later Three factors are in play here most caches werefilled 60 minutes before the attack and are timing out in the firstround While the timeout and query rounds are both 60 minutesapart Atlas intentionally spreads queries out over 5 minutes sowe expect that some queries happen after 59 minutes and others 61minutes

Second we know some large recursives have fragmented caches(sect35) so we expect that some of the successes between times 70and 110 minutes are due to caches that were filled between times10 and 50 minutes This can actually be seen in Figure 7 where weshow a timeseries of the answers for Experiment B where we seeCC (correct cache responses) between times 60 and 90

Third we see an increase in the number of CA queries that areanswered by the cache with expired TTLs (Figure 7) This increaseis due to servers serving stale content [19]

Caches Eventually All Expire Finally we carry out a thirdemulation but with half the cache lifetime (1800 s or 30minutesrather than the full hour) Figure 6c shows response over timeThese results are similar to Experiment B with rapid fall-off whenthe attack starts as caches age After the attack has been underwayfor 30minutes all caches must have expired and we see only a few(about 26) residual successes

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

53 Discussion of Complete FailuresOverall we see that caching is partially successful in protecting dur-ing a DDoS With full valid caches half or more VPs get serviceHowever caches are filled at different times and expire so an op-erator cannot count on a full cache duration for any customerseven for popular (ldquoalways in the cacherdquo) domains The protectionprovided by caches depends on their state in the recursive resolversomething outside the operatorrsquos control In addition our evalua-tion of caching in sect3 showed that caches will end early for someVPs

Second we were surprised that a tiny fraction of VPs are suc-cessful after all caches should have timed out (after the 80 minutesperiod in Experiment A and between 90 and 110 minutes in Exper-iment C) These successes suggest an early deployment of ldquoservestalerdquo something currently under review in the IETF [19] is to servea previously known record beyond its TTL if authoritatives areunreachable with the goal of improving resilience under DDoS Weinvestigated the Experiment A where see that 1048 answers of the1140 successes in the second half of the outage These successesare from 471 VPs (and 215 recursives) most of them answered byOpenDNS and Google public DNS servers suggesting experimen-tation not yet widespread Out of these 1048 queries 1031 return aTTL value equals to 0 as specified in the IETF stale draft [19]

54 Client Reliability During PartialAuthoritative Failure

The previous section examined DDoS attacks that result in com-plete failure of all authoritatives but often DDoS attacks result inpartial failure with 50 or 90 packet loss at the authoritatives(For example consider the November 2015 DDoS attack on theDNS Root [23]) We next study experiments with partial failuresshowing that caching and retries together nearly fully protect 50DDoS events and protect half of VPs even during 90 events

We carry out several Experiments D to I in Table 4 We follow theprocedure outlined in sect51 looking at the DDoS-driven loss ratesof 50 75 and 90 with TTLs of 1800 s 300 s and 60 s Graphsomitted due to space can be found in Appendix D

Near-Full Protection from Caches During Moderate At-tacks We first consider Experiment E a ldquomildrdquo DDoS with 50loss with VP success over time in Figure 8a In spite of a loss ratethat would be crippling to TCP nearly all VPs are successful in DNSThis success is due to two factors first we know that many clientsare served from caches as was shown in Experiment A with fullloss (Figure 6a) Second most recursives retry queries so they re-cover from loss of a single packet and are able to provide an answerTogether these mean that failures during the first 30 minutes of theevent is 85 slightly higher than the 48 fraction of failures beforethe DDoS For this experiment the TTL is 1800 s (30minutes) sowe might expect failures to increase halfway through the DDoSWe do not see any increase in failures because caching and retriesare synergistic a successful retried query will place the answer in acache for a later query The importance of this result is that DNScan survive moderate-size attacks when caching is possible While apositive retries do increase latency something we study in sect55

Attack IntensityMattersWhile clients do quite well with 50loss at all authoritatives failures increase with the intensity of theattack

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

50 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment E (1800-50p-10min) 50 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

Ans

wer

sminutes after start

OK SERVFAIL No answer

(b) Experiment F (1800-75p-10min) 75 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(c) Experiment H (1800-90p-10min) 90 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(d) Experiment I (60-90p-10min) 90 packet loss

Figure 8 Answers received during DDoS attacks first andsecond vertical lines show start and end of DDoS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiments F and H shown in Figure 8b and Figure 8c increasethe loss rate to 75 and 90We see the number of failures increasesto about 190 with 75 loss and 403 with 90 loss It is importantto note that roughly 60 the clients are still served even with 90loss

We also see that this level of success is consistent over the entirehour-long DDoS event even though the cache duration is only30minutes This consistency confirms the importance of cachingand retries in combination

To verify the effects of this interaction Experiment I changesthe caching duration to 60 s less than one round or probing Com-paring Experiment I in Figure 8d to H in Figure 8c we see that thefailure rate increases from 30 to about 63 However even withno caching about 37 of queries still are answered due to resolversthat serve stale content and recursives retries We investigate retriesin sect6

55 Client Latency During Partial AuthoritativeFailure

We showed that client reliability is higher than expected duringfailures (sect54) due to a combination of caching and retries Wenext consider client latency Latency will increase during the DDoSbecause of retries and queueing delay but we will show that latencyincreases less than one might expect due to caching

To examine latency we return to Experiments D through I (Ta-ble 4) but look at latency (time to complete a query) rather thansuccess For these experiments clients timeout after 5 s

Figures 9a to 9d show latency during each emulated DDoS sce-nario (experiments with figures omitted here are in Appendix D La-tencies are not evenly distributed since some requests get throughimmediately while others must be retried one or more times so inaddition to mean we show 50 75 and 90 quantiles to characterizethe tail of the distribution

We emulate DDoS by dropping requests (sect51) and hence laten-cies reflect retries and loss but not queueing delay underrepresent-ing latency in real-world attacks However their shape (some lowlatency and a few long) is consistent with and helps explain whathas been seen in the past [23]

Beginning with Experiment E the moderate attack in Figure 9awe see no change to median latency This result is consistent withmany queries being handled by the cache and half of those nothandled by the cache getting through anyway We do see higherlatency in the 90ile tail reflecting successful retries This tail alsoincreases the mean some

This trend increases in Experiment F in Figure 9b where 75 ofqueries are lost Now we see the 75ile tail has increased as hasthe number of unanswered queries and the 90ile is twice as longas in Experiment E

We see the same latency in Experiment H with DDoS causing90 loss We set the timeouts to 5 s so the larger attack results inmore unsuccessful queries but latency for successful queries is notmuch worse than with 75 loss Median latency is still low due tocached replies

Finally Experiment I greatly reduces opportunities for cachingby reducing cache lifetime to one minute Figure 9d shows that lossof caching increases median RTT and significantly increases thetail latency Compared with Figure 9c (same packet loss ratio but

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment E 50 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160la

tency (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment F 75 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(c) Experiment H 90 packet loss(1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(d) Experiment I 90 packet loss (60 s TTL)

Figure 9 Latency results Shaded area indicates the intervalof an ongoing DDoS attack

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

1800 s TTL) we can clearly see the benefits of caching in terms oflatency (in addition to reliability) a half-hour TTL value reducedthe latency from 1300ms to 390ms Longer TTLs also help reducetail latency relative to shorter TTLs (compare for example the90ile RTT in Experiments I vs H in Figure 9)

Summary DDoS effects often increase client latency For mod-erate attacks increased latency is seen only by a few ldquounluckyrdquoclients whose do not see a full cache and whose queries are lostCaching has an important role in reducing latency during DDoSbut while it can often mitigate most reliability problems it cannotavoid latency penalties for all VPs Even when caching is not avail-able roughly 40 of clients get an answer either by serving staleor retries as we investigate next

6 THE AUTHORITATIVErsquoS PERSPECTIVEResults of partial DDoS events (sect54) show that DNS is surprisinglyreliablemdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches These results aredue to a combination of caching and retries We next examine thisfrom the perspective of the authoritative server

61 Recursive-Authoritative Traffic during aDDoS

We first ask are retries by recursive resolvers responsible for thesuccess rates observed in sect54 To investigate this question wereturn the partial DDoS experiments and look at how many queriesare sent to the authoritative servers We measure queries beforethey are dropped by our simulated DDoS Recursives must makemultiple queries to resolve a name We break out each type of queryfor the nameserver (NS) the nameserverrsquos IPv4 and v6 addresses(A-for-NS and AAAA-for-NS) and finally the desired query (AAAA-for-PID) Note that the authoritative is IPv4 only so AAAA-for-NSis non-existent and subject to negative caching while the otherrecords exist and use regular caching

We begin with the DDoS causing 75 loss in Figure 10a For thisexperiment we observe 18407 unique IP addresses of recursives(Rn ) querying for AAAA records directly to our authoritativesDuring the DDoS queries increase by about 35times We expect 4trials since the expected number of tries until success with lossrate p is (1 minus p)minus1 For this scenario results are cached for up to30 minutes so successful queries are reused in recursive cachesThis increase occurs both for the target AAAA record and alsofor the non-existent AAAA-for-NS records Negative caching forour zone is configured to 60 s making caching of NXDOMAINs forAAAA-for-NS less effective than positive caches

The offered load on the server increases further with more loss(90) as shown in Experiment H (Figure 10b) The higher loss rateresults in a much higher offered load on the server average 82timesnormal

Finally in Figure 10c we reduce the effects of caching at a 90DDoS and with a TTL of 60 s Here we see also about 81times morequeries at the server before the attack Comparing this case toExperiment H caching reduces the offered load on the server byabout 40

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(a) Experiment F 1800-75p-10min 75 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(b) Experiment H 1800-90p-10min 90 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(c) Experiment I 60-90p-10min 90 packet loss

Figure 10 Number of received queries by the authoritativeservers Shaded area indicates the interval of an ongoingDDoS attack

Implications The implication of this analysis is that legitimateclients ldquohammerrdquo with retries the already-stressed server during aDDoS For clients retries are important to get reliability and eachclient independently chooses to retry

The server is already under stress due to the DDoS so these re-tries add to that stress However the DDoS traffic is almost certainlymuch larger than the retried of legitimate traffic (A server experi-encing a volumetric attack causing 90 loss must be receiving 10timesits capacity Regular traffic is a small fraction of normal capacity soeven 4times regular is still much less than the attack traffic) The multi-plier for retried legitimate traffic depends on the implementationsstub and recursive resolver as well as application-level retries anddefection (users hitting reload in their browser and later giving up)Our experiment omits application-level retries and likely gives a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

lower bound We next examine specific recursive implementationsto see their behavior

62 Sources of Retries Software and Multi-levelRecursives

Experiments in the prior section showed that recursive resolversldquohammerrdquo authoritatives when queries are dropped We reexamineDNS software (since 2012 [56]) and additionally show deploymentsamplify retries

Recursive Software Prior work showed that recursive serversretry many times when an authoritative is unresponsive [56] withevaluation of BIND 97 and 98 DNSCache UnboundWindowsDNSand PowerDNS We studied retries in BIND 9103 and Unbound158 to quantify the number of retries Examining only requestsfor AAAA records we see that normal requests with a responsiveauthoritative ask for the AAAA records for all authoritatives andthe target name (3 total requests when there are 2 authoritatives)When all authoritatives are unavailable we see about 7times morerequests before the recursives time out (Exact numbers vary indifferent runs but typically each request is made 6 or 7 times)Such retries are appropriate provided they are paced (both useexponential backoff) they explain part of the increase in legitimatetraffic during DDoS events Full data is in Appendix E

Recursive DeploymentAnother source of extra retries is com-plex recursive deployments We showed that operators of largerecursives often use complex multi-level resolution infrastructure(sect35) This infrastructure can amplify the number of retries duringreachability problems at authoritatives

To quantify amplification we count both the number of Rn re-cursives and AAAA queries for each probe ID reaching our author-itatives Figure 11 show the results for Experiment I These valuesrepresent the amplification in two ways during stress more Rnrecursives will be used for each probe ID and these Rn will generatemore queries to the already stressed authoritatives As the figuresshow the median number of Rn recursives employed doubles (from1 to 2) during the DDoS event as does the 90ile (from 2 to 4) Themaximum rises to 39 The number of queries for each probe IDgrows more than 3times from 2 to 7 Worse the 90ile grows morethan 6times (3 queries to 18) The maximum grows 535times reachingup to 286 queries for one single probe ID This value however isa lower bound given there are a large number of A and AAAAqueries that ask for NS records and not the probe ID (AAAA andA-for NS in Figure 10)

We can also look at the aggregate effects of retries created by thecomplex recursive infrastructure Figure 12 shows the timeseries ofunique IP addresses of Rn observed at the authoritatives Before theDDoS period for Experiment I with TTL of 60 s we see a constantnumber of recursives reaching our authoritatives ie all queriesshould be answered by authoritatives (no caching at this TTL value)For experiments F and H both with TTL of 1800 s the number ofrecursives reaching our authoritative oscillates before the DDoSpeaks are observed when caches expire as expected

During the DDoS we observe a similar behavior for all threeexperiments in Figure 12 as packets are dropped at the authori-tative (at rates of 75 90 and 90 for F H and I respectively) wesee an increase on the number of Rn recursives querying our au-thoritatives for experiments F and H we see drops when caching

1

10

100

1000

0 20 40 60 80 100 120 140 160 180 1

10

100

1000

Rn-p

erP

ID

AA

AA

-for-

PID

minutes after start

Rn-per-PID-medianRn-per-PID-90-tileRn-per-PID-max

AAAA-for-PID-medianAAAA-for-PID-90-tileAAAA-for-PID-max

Figure 11 Rn recursives and AAAA queries used in Experi-ment I normalized by the number of probe IDs

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140 160 180

R

n r

eachin

g A

T

minutes after start

Experiment FExperiment HExperiment I

Figure 12 Unique Rn recursives addresses observed at au-thoritatives

is expected but not for experiment I The reason for this behavioris that the underlying layer of recursives starts forwarding queriesto other recursives which is amplified in the end (We show thisbehavior for an individual probe in Appendix F where we observethe growth in the number of queries received at the authoritativesand the number of recursives used)

Most complex resolution infrastructures are proprietary (as faras we know only one study has examined them [48]) so we cannotmake recommendations about how large recursive resolvers oughtto behave We suggest that the aggregate traffic of large recursiveresolvers should strive to be within a constant factor of singlerecursives perhaps a factor of 4 We also encourage additionalstudy of large recursive resolvers and their operators to shareinformation about their behavior

7 RELATEDWORKCaching by Recursives Several groups have shown that DNScaching can be imperfect Hao and Wang analyzed the impact ofnonce domains on DNS recursiversquos caches [13] Using two weeksof data from two universities they showed that filtering one-timedomains improves cache hit rates In two studies Pang et al [31 32]reported that web clients and local recursives do not always honorTTL values provided by authoritatives Almeida et al [2] analyzedDNS traces of a mobile operator and used a mobile applicationto see TTLS in practice They find that most domains have shortTTLs (less than 60 s) and report and evidence of TTL manipulationby recursives Schomp et al [48] demonstrate widespread use of

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

multi-level recursives by large operators as well as TTL manip-ulation Our work builds on this prior work examining cachingand TTL manipulation systematically and considering its effectson resilience

DNS client behavior Yu et al investigated how stubs and re-cursives select authoritative servers and were the first to demon-strate the large number of retries when all authoritatives are un-available [56] We also investigated how recursives select authori-tative servers in the wild and found that recursives tend to preferauthoritatives with shorter latency but query all authoritatives fordiversity [27] We confirm Yursquos work and focus on authoritativeselection during DDoS from several perspectives

Authoritatives during DDoS We investigated how the RootDNS service behaved during the Nov 2015 DDoS attacks [23] Thisreport focuses on the interactions of IP anycast and both latencyand reachability as seen from RIPE Atlas Rather than look at ag-gregate behavior and anycast our methodology here examines howclients interact with their recursive resolvers while this prior workfocused on authoritatives only bypassing recursives In additionhere we have full access to clients and authoritatives traffic dur-ing our experiments and we evaluate DDoS with controlled lossrates The prior study has incomplete data and focuses on specificresults of two events These differences stem from their study ofnatural experiments from real-world events and our controlledexperiments

8 IMPLICATIONSWe evaluated DNS resilience showing that caches and retries canmitigate much of the harm from a DDoS attack provided the cacheis full and some requests can get to authoritative servers The keyimplication of our study is to explain differences in the outcome ofrecent DDoS attacks

Recent DDoS attacks on DNS services have seen very differ-ent outcomes for users The Root Server System was a target inNov 2015 [41] and June 2016 [42] The DNS Root has 13 letterseach an authoritative ldquoserverrdquo implemented with some or many IPanycast instances Analysis of these DDoS events showed that theireffects were uneven across letters for some most or all anycastinstances showed high loss while other letters showed little or noloss [23] However the Root Operators state ldquoThere are no knownreports of end-user visible error conditions during and as a resultof this incident Because the DNS protocol is designed to cope withpartial reachability rdquo [41]

In Oct 2016 a much larger attack was directed at Dyn a providerof DNS service for many second-level domains [14] Although Dynhas a capable infrastructure and immediately took steps to addressservice problems there were reports of user-visible service disrup-tion in the technical and even popular press [34] Reports describeintermittent failure of prominent websites including ldquoTwitter Net-flix Spotify Airbnb Reddit Etsy SoundCloud and The New YorkTimesrdquo each a direct or indirect customer of Dyn at the time

Our work helps explain these very different outcomes The RootDNS saw few or no user-visible problems because data in the rootzone is cachable for a day or more and because multiple letters andmany anycast instances were continuously available (All measure-ments in this paragraph are as of 2018-05-22) Records in the rootzone have TTLs of 1 to 6 days and wwwroot-serversorg reports

922 anycast instances operating across the 13 authoritative serversDyn also operates a large infrastructure (httpsdyncomdnsnetwork-map reports 20 ldquofacilitiesrdquo) and faced a larger attack (reportsof 12 Tbs [47] compared to estimates of 35 Gbs for the Nov 2015root attack [23]) But a key difference is all of the Dynrsquos customerslisted above use DNS-based CDNs (for a description see [7]) withmultiple Dyn-hosted DNS components with TTLs that range from120 to 300 s

In addition to explaining the effects our experiments help get tothe root causes behind these outcomes Users of the Root benefitedfrom caching and saw performance like Experiment E (Figure 8a)because root contents (TLDs like com and country codes) are popu-lar and certainly cached in recursives and because some root letterswere always available to refresh caches (either through a successfulnormal query or a retry) By contrast users requiring domainswith very short TTLs (like the websites that had problems) receiveperformance more like Experiment I (Figure 8d) or Experiment C(Figure 6c) Even when some requests succeed an cache a popularname short TTLs cause caches to clear quickly

This example shows the importance of DNSrsquos multiple methodsof resilience (caching retries and at least some availability at oneauthoritative) It suggests that CDN operators may wish to considerlonger timeouts to allow caching to help and give DNS operatorsdeploy defenses Experiment H suggests 30 minutes Figure 8c

Configuring short TTLs serves a role in CDNs that use DNSto direct clients to different application-level servers Short TTLsallow for re-provisioning during DDoS attacks on web servers butthat leaves DNS servers vulnerable This tension suggests trafficscrubbing by routing changeswith longDNS TTLsmay be preferredto short DNS TTLs so that both layers can be robust Howeverthe complexity of interactions between DNS at multiple levels andCDNs suggests that more study is needed before recommendingspecific settings

Finally this evaluation helps complete our picture of DNS la-tency and reliability for DNS services that may consist of multipleauthoritatives some or all using IP anycast with multiple sites Tominimize latency prior work has shown a single authoritative usingIP anycast should maximize geographic dispersion of sites [46] Thelatency of an overall DNS service with multiple authoritatives canbe limited by the one with largest latency [27] Prior work aboutresilience to DDoS attack has shown that individual IP anycastsites will suffer under DDoS as a function of the attack traffic thatsite receives relative to its capacity [23] We show that the overallreliance of a DNS service composed of multiple authoritatives usingIP anycast tends to be as resilient as the strongest individual author-itative The reason for these opposite results is that in both casesrecursive resolvers will try all authoritatives of a given service Forlatency they will sometimes choose a distant authoritative butfor resilience they will continue until they find the most availableauthoritative

9 CONCLUSIONSThis paper represents the first study of how the DNS resolutionsystem behaves when authoritative servers are under DDoS attackCaching and retries at recursive resolvers are key factors in thisbehavior We show that together caching and retries by recursiveresolvers greatly improve the resilience of the DNS as a whole In

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 6: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TTL 60 1800 3600 86400 3600-10mAC Answers 37 24645 24091 23202 47262Public R1 0 12000 11359 10869 21955Google Public R1 0 9693 9026 8585 17325other Public R1 0 2307 2333 2284 4630

Non-Public R1 37 12645 12732 12333 25307Google Public Rn 0 1196 1091 248 1708other Rn 37 11449 11641 12085 23599

Table 3 AC answers public resolver classification

addresses for 96 public recursives (Appendix C) we obtain fromDuckDuckGo search for ldquopublic dnsrdquo done on 2018-01-15

Table 3 reexamines the AC replies from Table 2 With the ex-ception of the measurements with TTL of 60 s nearly half of ACanswers (cache misses) are from queries to public R1 recursivesand about three-quarters of these are from Googlersquos Public DNSThe other half of cache misses start at non-public recursives but10 of these eventually emerge from Googlersquos DNS

Besides identifying public recursives we also see evidence ofcache fragmentation in answers from caches (CC and CA) Some-times we see serial numbers in consecutive answers decrease Forexample one VP reports serial numbers 1 3 3 7 3 3 suggestingthat it is querying different recursives one with serial 3 and anotherwith serial 7 in its cache We show these occurrences in Table 2 asCCdec and CAdec With longer TTLs we see more cache fragmen-tation with 45 of answers showing fragmentation with day-longTTLs

From these observations we conclude that cache misses resultfrom several causes (1) use of load balancers or anycast whereservers lack shared caches (2) first-level recursives that do notcache and have multiple second-level recursives and (3) cachesmay reset between the somewhat long probing interval (10 or 20minutes) Causes (1) and (2) occur in public resolvers (confirmedby Google [12]) and account for about half of the cache misses inour measurements

4 CACHING PRODUCTION ZONESIn sect3 we show that about one-third of queries do not conform withcaching expectations based on controlled experiments to our testdomain (Results may be better for caches that prioritize popularnames) We next examine this question for specific records in nlthe country code domain (ccTLD) for the Netherlands and the Root() DNS zone With traffic from ldquothe wildrdquo and a measurement targetused by millions this section uses a domain popular enough to stayin-cache at recursives

41 Requests at nlrsquos AuthoritativesWe apply this methodology to data for nl country-code top-leveldomain (ccTLD) We look specifically at the A-records for the name-servers of nl ns[1-5]dnsnl

MethodologyWe use passive observations of traffic to the nlauthoritative servers

For each target name in the zone and source (some recursiveserver identified by IP address) we build a timeseries of all requestsand compute their interarrival time ∆ Following the classificationfrom sect34 we label queries as AC if ∆ lt TTL showing an unnec-essary query to the authoritative AA if ∆ ge TTL an expected or

0 01 02 03 04 05 06 07 08 09

1

0 2000 4000 6000 8000 10000

CD

F

Δ t

Figure 4 ECDF of the median ∆t for recursives with at least5 queries to ns1-ns5dnsnl (TTL of 3600 s)

delayed cache refresh (We do not see cache hits and so there areno CC events)

Dataset At the time of our analysis (February 2018) there were8 authoritative servers for the nl zone We collect traffic for the 4unicast and one anycast authoritative servers and store the data inENTRADA [55] for analysis

Since our data for nl is incomplete and we know recursives willquery all authoritatives over time [27] our analysis represents aconservative estimate of TTL violationsmdashwe expect to miss someCA-type queries from resolvers to non-monitored authoritatives

We collect data for a period of six hours on 2018-02-22 starting at1200 UTCWe only evaluate recursives that sent at least five queriesfor our domains of interest omitting infrequent recursives (theydo not change results noticeably) We discard duplicate queries forexample a few retransmissions (less than 001 of the total queries)In total we consider more than 485k queries from 7779 differentrecursives

Results Figure 4 shows the distribution of ∆t that we observein our measurements reporting the median ∆t for any resolver thatsends at least 5 queries

About 28 of queries are frequent with an inter-arrival less than10 s and 32 of these are sent to multiple authoritatives We believethese are due to recursives submitting queries in parallel to speedup replies (perhaps the ldquoHappy Eyeballsrdquo algorithm [45])

Since these closely-timed queries are not related to recursivecaching we exclude them from analysis The remaining data is 348kqueries from 7703 different recursives

The largest peak is at 3600 s what was expected the name wasqueried and cached for the full hour TTL then the next requestcauses the name to be re-fetched These queries are all of type AA

The smaller peak around 1800 s as well as queries with othertimes less than 3600 s correspond to type AC-queriesmdashqueries thatcould have been supplied from the cache but were not 22 ofresolvers sent most of their queries within an time interval thatis less than 3600 s or even more frequent These AC queries occurbecause of TTL limiting cache fragmentation or other reasons thatclear the cache

42 Requests at the DNS RootIn this section we perform a similar analysis as for sect41 in whichwe look into DNS queries received at all Root DNS servers (except

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

075

08

085

09

095

1

5 10 15 20 25 30

CD

F

number of queries

F-Root

H-Root

All roots

Figure 5 Distribution of the number of queries for the DSrecord of nl received for each recursive Dataset DNS-OARCDITL on 2017-04-12t0000Z for 24 hours All Root serverswith similar distributions are shown in light-gray lines

G-Root) and create a distribution of the number of queries receivedper source IP address (ie per recursive)

In this analysis we use data from the DITL (Day In The Life)dataset of 2017 available at DNS-OARC [9] We look at all DNSqueries received for the DS record of the domain nl received at theRoot DNS servers along the entire day on April 12 2017 (UTC) Thisdataset consists of queries from more than 703k unique recursivesseen across all Root servers Note that the DS record for nl hasa TTL of 86400 seconds (24 hours) That is in theory one couldexpect to see just one query per recursive arriving at a given rootletter for the DS record of nl within the 24-hour interval

Each line in Figure 5 shows the distribution of the total numberof queries received at the Root servers from individual recursivesasking for the DS record of nl Besides F- and H-Root the distribu-tion is similar across all Root servers these are plotted in light-graylines F-Root shows the ldquomost friendlyrdquo behavior from recursiveswhere around 5 of them sent 5 or more queries for nl As opposedto F H-Root (dotted red line) shows the ldquoworstrdquo behavior fromrecursives where more than 10 of them sent 5 or more queriesfor nl within the 24-hour period

The solid black line in Figure 5 shows the distribution for allthe queries across all Root servers The majority (around 87) ofrecursives does send only one query within the 24-hour intervalHowever considering all Root servers we see around 13 of recur-sives that have sent multiple queries Note that the distributionsshown in Figure 5 have (very) long tails and we see up to more than218k queries from a single recursive within the 24-hour period forthe nl DS record ie roughly one query every 4 seconds from thesame IP address for the same DS record

Discussion we conclude that measurements of popular domainswithin nl (sect41) and the Roots (sect42) show that about 63 and 87of recursives honor the full TTL respectively These results areroughly in-line with our observations with RIPE Atlas (sect3)

5 THE CLIENTrsquoS VIEW OF AUTHORITATIVESUNDER DDOS

We next use controlled experiments to evaluate how DDoS attacksat authoritative DNS servers impacts client experience Our studiesof caching in controlled experiments (sect3) and passive observations(sect4) have shown that caching often works but not alwaysmdashabout

70 of controlled experiments and 30 of passive observations seefull cache lifetimes Since results of specific experiments vary wesweep the space of attack intensities to understand the range ofresponse from complete failure of authoritative servers to partialfailures

51 Emulating DDoSTo emulate DDoS attacks we begin with the same test domain(cachetestnl) we used for controlled experiments in sect3 We run anormal DNS service for some time querying from RIPE Atlas Aftercaches are warm we then simulate a DDoS attack by droppingsome fraction or all incoming DNS queries to each authoritative(We drop incoming traffic randomly with Linux iptables As suchpacket drop is not biased towards any recursive) After we begindropping traffic answers come either from caches at recursives orfor partial attacks from a lucky query that passes through

This emulation of DDoS captures traffic loss that occurs in DDoSattack as router queues overflow This emulation is not perfectsince we simulate loss at the last hop-router but in real DDoSattacks packets are often lost on access links near the target Ouremulation approximates this effect with one aggregate loss rate

DDoS attacks are also accompanied by queueing delay sincebuffers at and near the target are full We do not model queueingdelay although we do observe latency increasing due to retries Inmodern routers queueing delay due to full router buffers should beless than the retry interval In addition observations during real-world DDoS events show that the few queries that are successfulsee response times that are not much higher than typical [23]suggesting that loss (and not delay) is the dominant effect of DDoSin practice However a study that adds queueing latency to theattack model is interesting future work

52 Clients During Complete AuthoritativesFailure

We first evaluate the worst-case scenario for a DNS operator com-plete unreachability of all authoritative name servers Our goal isto understand when and for how long caches cover such an outage

Table 4 shows Experiments A B and C which simulate completefailure In Experiment A each VP makes only one query beforethe DDoS begins In Experiment B we allow several queries to takeplace and Experiment C allows several queries with a shorter TTL

Caches Protect Some We first consider Experiment A withone query that warms the cache immediately followed by the attackFigure 6a shows these responses over time with the onset of theattack the first downward arrow between 0 and 10 minutes andwith the cache expired after the second downward arrow between60 and 70 minutes We see that after the DDoS starts but before thecache has fully expired (between the downward arrows) initially30 and eventually 65 of queries fail with either no answer or aSERVFAIL error While not good this does mean that 35 to 70of queries during the DDoS are successfully served from the cacheBy contrast shortly after the cache expires almost all queries fail(only 25 VPs or 02 of the total seem to provide stale answers)

Caches Fill at Different Times In a more realistic scenarioVPs have filled their caches at different times In Experiment Acaches are freshly filled and should last for a full hour after thestart of attack Experiment B is designed for the opposite and worst

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiment ParametersTTL DDoS DDoS queries total probe failurein sec start dur before dur interval

A 3600 10 60 1 120 10 100 (both NSes)B 3600 60 60 6 240 10 100 (both NSes)C 1800 60 60 6 180 10 100 (both NSes)D 1800 60 60 6 180 10 50 (one NS)E 1800 60 60 6 180 10 50 (both NSes)F 1800 60 60 6 180 10 75 (both NSes)G 300 60 60 6 180 10 75 (both NSes)H 1800 60 60 6 180 10 90 (both NSes)I 60 60 60 6 180 10 90 (both NSes)

ResultsTotal Valid VPs Queries Total Validprobes probes answers answers

A 9224 8727 15339 136423 76619 76181B 9237 8827 15528 357102 293881 292564C 9261 8847 15578 258695 199185 198197D 9139 8708 15332 286231 273716 272231E 9153 8708 15320 285325 270179 268786F 9141 8727 15325 278741 259009 257740G 9206 8771 15481 274755 249958 249042H 9226 8778 15486 269030 242725 241569I 9224 8735 15388 253228 218831 217979

Table 4 DDoS emulation experiments [38] DDoS start durations and probe interval are given in minutes

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110

cache-only cache-expired

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment A 3600-10min-1down arrows indicate DDoS start andcache expiration

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment B 3600-10min-1down-1up arrows indicate DDoS startand recovery

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

cache-only

cache-expired

(c) Experiment C 1800-10min-1down-1up arrows indicate DDoS startcache expiration and recovery

Figure 6 Answers received during DDoS attacks

0

4000

8000

12000

16000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170answ

ers

minutes after start

AA

CC

CA

Figure 7 Timeseries of answers for Experiment B

case we begin warming the cache one hour before the attack andquery 6 times from each VP Other parameters are the same withthe attack lasting for 60 minutes (also the cache duration) but thenwe restore the authoritatives to service

Figure 6b shows the results of Experiment B While about 50 ofVPs are served from the cache in the first 10 minute round after theDDoS starts the fraction served drops quickly and is at only about3 one hour later Three factors are in play here most caches werefilled 60 minutes before the attack and are timing out in the firstround While the timeout and query rounds are both 60 minutesapart Atlas intentionally spreads queries out over 5 minutes sowe expect that some queries happen after 59 minutes and others 61minutes

Second we know some large recursives have fragmented caches(sect35) so we expect that some of the successes between times 70and 110 minutes are due to caches that were filled between times10 and 50 minutes This can actually be seen in Figure 7 where weshow a timeseries of the answers for Experiment B where we seeCC (correct cache responses) between times 60 and 90

Third we see an increase in the number of CA queries that areanswered by the cache with expired TTLs (Figure 7) This increaseis due to servers serving stale content [19]

Caches Eventually All Expire Finally we carry out a thirdemulation but with half the cache lifetime (1800 s or 30minutesrather than the full hour) Figure 6c shows response over timeThese results are similar to Experiment B with rapid fall-off whenthe attack starts as caches age After the attack has been underwayfor 30minutes all caches must have expired and we see only a few(about 26) residual successes

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

53 Discussion of Complete FailuresOverall we see that caching is partially successful in protecting dur-ing a DDoS With full valid caches half or more VPs get serviceHowever caches are filled at different times and expire so an op-erator cannot count on a full cache duration for any customerseven for popular (ldquoalways in the cacherdquo) domains The protectionprovided by caches depends on their state in the recursive resolversomething outside the operatorrsquos control In addition our evalua-tion of caching in sect3 showed that caches will end early for someVPs

Second we were surprised that a tiny fraction of VPs are suc-cessful after all caches should have timed out (after the 80 minutesperiod in Experiment A and between 90 and 110 minutes in Exper-iment C) These successes suggest an early deployment of ldquoservestalerdquo something currently under review in the IETF [19] is to servea previously known record beyond its TTL if authoritatives areunreachable with the goal of improving resilience under DDoS Weinvestigated the Experiment A where see that 1048 answers of the1140 successes in the second half of the outage These successesare from 471 VPs (and 215 recursives) most of them answered byOpenDNS and Google public DNS servers suggesting experimen-tation not yet widespread Out of these 1048 queries 1031 return aTTL value equals to 0 as specified in the IETF stale draft [19]

54 Client Reliability During PartialAuthoritative Failure

The previous section examined DDoS attacks that result in com-plete failure of all authoritatives but often DDoS attacks result inpartial failure with 50 or 90 packet loss at the authoritatives(For example consider the November 2015 DDoS attack on theDNS Root [23]) We next study experiments with partial failuresshowing that caching and retries together nearly fully protect 50DDoS events and protect half of VPs even during 90 events

We carry out several Experiments D to I in Table 4 We follow theprocedure outlined in sect51 looking at the DDoS-driven loss ratesof 50 75 and 90 with TTLs of 1800 s 300 s and 60 s Graphsomitted due to space can be found in Appendix D

Near-Full Protection from Caches During Moderate At-tacks We first consider Experiment E a ldquomildrdquo DDoS with 50loss with VP success over time in Figure 8a In spite of a loss ratethat would be crippling to TCP nearly all VPs are successful in DNSThis success is due to two factors first we know that many clientsare served from caches as was shown in Experiment A with fullloss (Figure 6a) Second most recursives retry queries so they re-cover from loss of a single packet and are able to provide an answerTogether these mean that failures during the first 30 minutes of theevent is 85 slightly higher than the 48 fraction of failures beforethe DDoS For this experiment the TTL is 1800 s (30minutes) sowe might expect failures to increase halfway through the DDoSWe do not see any increase in failures because caching and retriesare synergistic a successful retried query will place the answer in acache for a later query The importance of this result is that DNScan survive moderate-size attacks when caching is possible While apositive retries do increase latency something we study in sect55

Attack IntensityMattersWhile clients do quite well with 50loss at all authoritatives failures increase with the intensity of theattack

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

50 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment E (1800-50p-10min) 50 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

Ans

wer

sminutes after start

OK SERVFAIL No answer

(b) Experiment F (1800-75p-10min) 75 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(c) Experiment H (1800-90p-10min) 90 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(d) Experiment I (60-90p-10min) 90 packet loss

Figure 8 Answers received during DDoS attacks first andsecond vertical lines show start and end of DDoS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiments F and H shown in Figure 8b and Figure 8c increasethe loss rate to 75 and 90We see the number of failures increasesto about 190 with 75 loss and 403 with 90 loss It is importantto note that roughly 60 the clients are still served even with 90loss

We also see that this level of success is consistent over the entirehour-long DDoS event even though the cache duration is only30minutes This consistency confirms the importance of cachingand retries in combination

To verify the effects of this interaction Experiment I changesthe caching duration to 60 s less than one round or probing Com-paring Experiment I in Figure 8d to H in Figure 8c we see that thefailure rate increases from 30 to about 63 However even withno caching about 37 of queries still are answered due to resolversthat serve stale content and recursives retries We investigate retriesin sect6

55 Client Latency During Partial AuthoritativeFailure

We showed that client reliability is higher than expected duringfailures (sect54) due to a combination of caching and retries Wenext consider client latency Latency will increase during the DDoSbecause of retries and queueing delay but we will show that latencyincreases less than one might expect due to caching

To examine latency we return to Experiments D through I (Ta-ble 4) but look at latency (time to complete a query) rather thansuccess For these experiments clients timeout after 5 s

Figures 9a to 9d show latency during each emulated DDoS sce-nario (experiments with figures omitted here are in Appendix D La-tencies are not evenly distributed since some requests get throughimmediately while others must be retried one or more times so inaddition to mean we show 50 75 and 90 quantiles to characterizethe tail of the distribution

We emulate DDoS by dropping requests (sect51) and hence laten-cies reflect retries and loss but not queueing delay underrepresent-ing latency in real-world attacks However their shape (some lowlatency and a few long) is consistent with and helps explain whathas been seen in the past [23]

Beginning with Experiment E the moderate attack in Figure 9awe see no change to median latency This result is consistent withmany queries being handled by the cache and half of those nothandled by the cache getting through anyway We do see higherlatency in the 90ile tail reflecting successful retries This tail alsoincreases the mean some

This trend increases in Experiment F in Figure 9b where 75 ofqueries are lost Now we see the 75ile tail has increased as hasthe number of unanswered queries and the 90ile is twice as longas in Experiment E

We see the same latency in Experiment H with DDoS causing90 loss We set the timeouts to 5 s so the larger attack results inmore unsuccessful queries but latency for successful queries is notmuch worse than with 75 loss Median latency is still low due tocached replies

Finally Experiment I greatly reduces opportunities for cachingby reducing cache lifetime to one minute Figure 9d shows that lossof caching increases median RTT and significantly increases thetail latency Compared with Figure 9c (same packet loss ratio but

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment E 50 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160la

tency (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment F 75 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(c) Experiment H 90 packet loss(1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(d) Experiment I 90 packet loss (60 s TTL)

Figure 9 Latency results Shaded area indicates the intervalof an ongoing DDoS attack

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

1800 s TTL) we can clearly see the benefits of caching in terms oflatency (in addition to reliability) a half-hour TTL value reducedthe latency from 1300ms to 390ms Longer TTLs also help reducetail latency relative to shorter TTLs (compare for example the90ile RTT in Experiments I vs H in Figure 9)

Summary DDoS effects often increase client latency For mod-erate attacks increased latency is seen only by a few ldquounluckyrdquoclients whose do not see a full cache and whose queries are lostCaching has an important role in reducing latency during DDoSbut while it can often mitigate most reliability problems it cannotavoid latency penalties for all VPs Even when caching is not avail-able roughly 40 of clients get an answer either by serving staleor retries as we investigate next

6 THE AUTHORITATIVErsquoS PERSPECTIVEResults of partial DDoS events (sect54) show that DNS is surprisinglyreliablemdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches These results aredue to a combination of caching and retries We next examine thisfrom the perspective of the authoritative server

61 Recursive-Authoritative Traffic during aDDoS

We first ask are retries by recursive resolvers responsible for thesuccess rates observed in sect54 To investigate this question wereturn the partial DDoS experiments and look at how many queriesare sent to the authoritative servers We measure queries beforethey are dropped by our simulated DDoS Recursives must makemultiple queries to resolve a name We break out each type of queryfor the nameserver (NS) the nameserverrsquos IPv4 and v6 addresses(A-for-NS and AAAA-for-NS) and finally the desired query (AAAA-for-PID) Note that the authoritative is IPv4 only so AAAA-for-NSis non-existent and subject to negative caching while the otherrecords exist and use regular caching

We begin with the DDoS causing 75 loss in Figure 10a For thisexperiment we observe 18407 unique IP addresses of recursives(Rn ) querying for AAAA records directly to our authoritativesDuring the DDoS queries increase by about 35times We expect 4trials since the expected number of tries until success with lossrate p is (1 minus p)minus1 For this scenario results are cached for up to30 minutes so successful queries are reused in recursive cachesThis increase occurs both for the target AAAA record and alsofor the non-existent AAAA-for-NS records Negative caching forour zone is configured to 60 s making caching of NXDOMAINs forAAAA-for-NS less effective than positive caches

The offered load on the server increases further with more loss(90) as shown in Experiment H (Figure 10b) The higher loss rateresults in a much higher offered load on the server average 82timesnormal

Finally in Figure 10c we reduce the effects of caching at a 90DDoS and with a TTL of 60 s Here we see also about 81times morequeries at the server before the attack Comparing this case toExperiment H caching reduces the offered load on the server byabout 40

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(a) Experiment F 1800-75p-10min 75 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(b) Experiment H 1800-90p-10min 90 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(c) Experiment I 60-90p-10min 90 packet loss

Figure 10 Number of received queries by the authoritativeservers Shaded area indicates the interval of an ongoingDDoS attack

Implications The implication of this analysis is that legitimateclients ldquohammerrdquo with retries the already-stressed server during aDDoS For clients retries are important to get reliability and eachclient independently chooses to retry

The server is already under stress due to the DDoS so these re-tries add to that stress However the DDoS traffic is almost certainlymuch larger than the retried of legitimate traffic (A server experi-encing a volumetric attack causing 90 loss must be receiving 10timesits capacity Regular traffic is a small fraction of normal capacity soeven 4times regular is still much less than the attack traffic) The multi-plier for retried legitimate traffic depends on the implementationsstub and recursive resolver as well as application-level retries anddefection (users hitting reload in their browser and later giving up)Our experiment omits application-level retries and likely gives a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

lower bound We next examine specific recursive implementationsto see their behavior

62 Sources of Retries Software and Multi-levelRecursives

Experiments in the prior section showed that recursive resolversldquohammerrdquo authoritatives when queries are dropped We reexamineDNS software (since 2012 [56]) and additionally show deploymentsamplify retries

Recursive Software Prior work showed that recursive serversretry many times when an authoritative is unresponsive [56] withevaluation of BIND 97 and 98 DNSCache UnboundWindowsDNSand PowerDNS We studied retries in BIND 9103 and Unbound158 to quantify the number of retries Examining only requestsfor AAAA records we see that normal requests with a responsiveauthoritative ask for the AAAA records for all authoritatives andthe target name (3 total requests when there are 2 authoritatives)When all authoritatives are unavailable we see about 7times morerequests before the recursives time out (Exact numbers vary indifferent runs but typically each request is made 6 or 7 times)Such retries are appropriate provided they are paced (both useexponential backoff) they explain part of the increase in legitimatetraffic during DDoS events Full data is in Appendix E

Recursive DeploymentAnother source of extra retries is com-plex recursive deployments We showed that operators of largerecursives often use complex multi-level resolution infrastructure(sect35) This infrastructure can amplify the number of retries duringreachability problems at authoritatives

To quantify amplification we count both the number of Rn re-cursives and AAAA queries for each probe ID reaching our author-itatives Figure 11 show the results for Experiment I These valuesrepresent the amplification in two ways during stress more Rnrecursives will be used for each probe ID and these Rn will generatemore queries to the already stressed authoritatives As the figuresshow the median number of Rn recursives employed doubles (from1 to 2) during the DDoS event as does the 90ile (from 2 to 4) Themaximum rises to 39 The number of queries for each probe IDgrows more than 3times from 2 to 7 Worse the 90ile grows morethan 6times (3 queries to 18) The maximum grows 535times reachingup to 286 queries for one single probe ID This value however isa lower bound given there are a large number of A and AAAAqueries that ask for NS records and not the probe ID (AAAA andA-for NS in Figure 10)

We can also look at the aggregate effects of retries created by thecomplex recursive infrastructure Figure 12 shows the timeseries ofunique IP addresses of Rn observed at the authoritatives Before theDDoS period for Experiment I with TTL of 60 s we see a constantnumber of recursives reaching our authoritatives ie all queriesshould be answered by authoritatives (no caching at this TTL value)For experiments F and H both with TTL of 1800 s the number ofrecursives reaching our authoritative oscillates before the DDoSpeaks are observed when caches expire as expected

During the DDoS we observe a similar behavior for all threeexperiments in Figure 12 as packets are dropped at the authori-tative (at rates of 75 90 and 90 for F H and I respectively) wesee an increase on the number of Rn recursives querying our au-thoritatives for experiments F and H we see drops when caching

1

10

100

1000

0 20 40 60 80 100 120 140 160 180 1

10

100

1000

Rn-p

erP

ID

AA

AA

-for-

PID

minutes after start

Rn-per-PID-medianRn-per-PID-90-tileRn-per-PID-max

AAAA-for-PID-medianAAAA-for-PID-90-tileAAAA-for-PID-max

Figure 11 Rn recursives and AAAA queries used in Experi-ment I normalized by the number of probe IDs

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140 160 180

R

n r

eachin

g A

T

minutes after start

Experiment FExperiment HExperiment I

Figure 12 Unique Rn recursives addresses observed at au-thoritatives

is expected but not for experiment I The reason for this behavioris that the underlying layer of recursives starts forwarding queriesto other recursives which is amplified in the end (We show thisbehavior for an individual probe in Appendix F where we observethe growth in the number of queries received at the authoritativesand the number of recursives used)

Most complex resolution infrastructures are proprietary (as faras we know only one study has examined them [48]) so we cannotmake recommendations about how large recursive resolvers oughtto behave We suggest that the aggregate traffic of large recursiveresolvers should strive to be within a constant factor of singlerecursives perhaps a factor of 4 We also encourage additionalstudy of large recursive resolvers and their operators to shareinformation about their behavior

7 RELATEDWORKCaching by Recursives Several groups have shown that DNScaching can be imperfect Hao and Wang analyzed the impact ofnonce domains on DNS recursiversquos caches [13] Using two weeksof data from two universities they showed that filtering one-timedomains improves cache hit rates In two studies Pang et al [31 32]reported that web clients and local recursives do not always honorTTL values provided by authoritatives Almeida et al [2] analyzedDNS traces of a mobile operator and used a mobile applicationto see TTLS in practice They find that most domains have shortTTLs (less than 60 s) and report and evidence of TTL manipulationby recursives Schomp et al [48] demonstrate widespread use of

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

multi-level recursives by large operators as well as TTL manip-ulation Our work builds on this prior work examining cachingand TTL manipulation systematically and considering its effectson resilience

DNS client behavior Yu et al investigated how stubs and re-cursives select authoritative servers and were the first to demon-strate the large number of retries when all authoritatives are un-available [56] We also investigated how recursives select authori-tative servers in the wild and found that recursives tend to preferauthoritatives with shorter latency but query all authoritatives fordiversity [27] We confirm Yursquos work and focus on authoritativeselection during DDoS from several perspectives

Authoritatives during DDoS We investigated how the RootDNS service behaved during the Nov 2015 DDoS attacks [23] Thisreport focuses on the interactions of IP anycast and both latencyand reachability as seen from RIPE Atlas Rather than look at ag-gregate behavior and anycast our methodology here examines howclients interact with their recursive resolvers while this prior workfocused on authoritatives only bypassing recursives In additionhere we have full access to clients and authoritatives traffic dur-ing our experiments and we evaluate DDoS with controlled lossrates The prior study has incomplete data and focuses on specificresults of two events These differences stem from their study ofnatural experiments from real-world events and our controlledexperiments

8 IMPLICATIONSWe evaluated DNS resilience showing that caches and retries canmitigate much of the harm from a DDoS attack provided the cacheis full and some requests can get to authoritative servers The keyimplication of our study is to explain differences in the outcome ofrecent DDoS attacks

Recent DDoS attacks on DNS services have seen very differ-ent outcomes for users The Root Server System was a target inNov 2015 [41] and June 2016 [42] The DNS Root has 13 letterseach an authoritative ldquoserverrdquo implemented with some or many IPanycast instances Analysis of these DDoS events showed that theireffects were uneven across letters for some most or all anycastinstances showed high loss while other letters showed little or noloss [23] However the Root Operators state ldquoThere are no knownreports of end-user visible error conditions during and as a resultof this incident Because the DNS protocol is designed to cope withpartial reachability rdquo [41]

In Oct 2016 a much larger attack was directed at Dyn a providerof DNS service for many second-level domains [14] Although Dynhas a capable infrastructure and immediately took steps to addressservice problems there were reports of user-visible service disrup-tion in the technical and even popular press [34] Reports describeintermittent failure of prominent websites including ldquoTwitter Net-flix Spotify Airbnb Reddit Etsy SoundCloud and The New YorkTimesrdquo each a direct or indirect customer of Dyn at the time

Our work helps explain these very different outcomes The RootDNS saw few or no user-visible problems because data in the rootzone is cachable for a day or more and because multiple letters andmany anycast instances were continuously available (All measure-ments in this paragraph are as of 2018-05-22) Records in the rootzone have TTLs of 1 to 6 days and wwwroot-serversorg reports

922 anycast instances operating across the 13 authoritative serversDyn also operates a large infrastructure (httpsdyncomdnsnetwork-map reports 20 ldquofacilitiesrdquo) and faced a larger attack (reportsof 12 Tbs [47] compared to estimates of 35 Gbs for the Nov 2015root attack [23]) But a key difference is all of the Dynrsquos customerslisted above use DNS-based CDNs (for a description see [7]) withmultiple Dyn-hosted DNS components with TTLs that range from120 to 300 s

In addition to explaining the effects our experiments help get tothe root causes behind these outcomes Users of the Root benefitedfrom caching and saw performance like Experiment E (Figure 8a)because root contents (TLDs like com and country codes) are popu-lar and certainly cached in recursives and because some root letterswere always available to refresh caches (either through a successfulnormal query or a retry) By contrast users requiring domainswith very short TTLs (like the websites that had problems) receiveperformance more like Experiment I (Figure 8d) or Experiment C(Figure 6c) Even when some requests succeed an cache a popularname short TTLs cause caches to clear quickly

This example shows the importance of DNSrsquos multiple methodsof resilience (caching retries and at least some availability at oneauthoritative) It suggests that CDN operators may wish to considerlonger timeouts to allow caching to help and give DNS operatorsdeploy defenses Experiment H suggests 30 minutes Figure 8c

Configuring short TTLs serves a role in CDNs that use DNSto direct clients to different application-level servers Short TTLsallow for re-provisioning during DDoS attacks on web servers butthat leaves DNS servers vulnerable This tension suggests trafficscrubbing by routing changeswith longDNS TTLsmay be preferredto short DNS TTLs so that both layers can be robust Howeverthe complexity of interactions between DNS at multiple levels andCDNs suggests that more study is needed before recommendingspecific settings

Finally this evaluation helps complete our picture of DNS la-tency and reliability for DNS services that may consist of multipleauthoritatives some or all using IP anycast with multiple sites Tominimize latency prior work has shown a single authoritative usingIP anycast should maximize geographic dispersion of sites [46] Thelatency of an overall DNS service with multiple authoritatives canbe limited by the one with largest latency [27] Prior work aboutresilience to DDoS attack has shown that individual IP anycastsites will suffer under DDoS as a function of the attack traffic thatsite receives relative to its capacity [23] We show that the overallreliance of a DNS service composed of multiple authoritatives usingIP anycast tends to be as resilient as the strongest individual author-itative The reason for these opposite results is that in both casesrecursive resolvers will try all authoritatives of a given service Forlatency they will sometimes choose a distant authoritative butfor resilience they will continue until they find the most availableauthoritative

9 CONCLUSIONSThis paper represents the first study of how the DNS resolutionsystem behaves when authoritative servers are under DDoS attackCaching and retries at recursive resolvers are key factors in thisbehavior We show that together caching and retries by recursiveresolvers greatly improve the resilience of the DNS as a whole In

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 7: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

075

08

085

09

095

1

5 10 15 20 25 30

CD

F

number of queries

F-Root

H-Root

All roots

Figure 5 Distribution of the number of queries for the DSrecord of nl received for each recursive Dataset DNS-OARCDITL on 2017-04-12t0000Z for 24 hours All Root serverswith similar distributions are shown in light-gray lines

G-Root) and create a distribution of the number of queries receivedper source IP address (ie per recursive)

In this analysis we use data from the DITL (Day In The Life)dataset of 2017 available at DNS-OARC [9] We look at all DNSqueries received for the DS record of the domain nl received at theRoot DNS servers along the entire day on April 12 2017 (UTC) Thisdataset consists of queries from more than 703k unique recursivesseen across all Root servers Note that the DS record for nl hasa TTL of 86400 seconds (24 hours) That is in theory one couldexpect to see just one query per recursive arriving at a given rootletter for the DS record of nl within the 24-hour interval

Each line in Figure 5 shows the distribution of the total numberof queries received at the Root servers from individual recursivesasking for the DS record of nl Besides F- and H-Root the distribu-tion is similar across all Root servers these are plotted in light-graylines F-Root shows the ldquomost friendlyrdquo behavior from recursiveswhere around 5 of them sent 5 or more queries for nl As opposedto F H-Root (dotted red line) shows the ldquoworstrdquo behavior fromrecursives where more than 10 of them sent 5 or more queriesfor nl within the 24-hour period

The solid black line in Figure 5 shows the distribution for allthe queries across all Root servers The majority (around 87) ofrecursives does send only one query within the 24-hour intervalHowever considering all Root servers we see around 13 of recur-sives that have sent multiple queries Note that the distributionsshown in Figure 5 have (very) long tails and we see up to more than218k queries from a single recursive within the 24-hour period forthe nl DS record ie roughly one query every 4 seconds from thesame IP address for the same DS record

Discussion we conclude that measurements of popular domainswithin nl (sect41) and the Roots (sect42) show that about 63 and 87of recursives honor the full TTL respectively These results areroughly in-line with our observations with RIPE Atlas (sect3)

5 THE CLIENTrsquoS VIEW OF AUTHORITATIVESUNDER DDOS

We next use controlled experiments to evaluate how DDoS attacksat authoritative DNS servers impacts client experience Our studiesof caching in controlled experiments (sect3) and passive observations(sect4) have shown that caching often works but not alwaysmdashabout

70 of controlled experiments and 30 of passive observations seefull cache lifetimes Since results of specific experiments vary wesweep the space of attack intensities to understand the range ofresponse from complete failure of authoritative servers to partialfailures

51 Emulating DDoSTo emulate DDoS attacks we begin with the same test domain(cachetestnl) we used for controlled experiments in sect3 We run anormal DNS service for some time querying from RIPE Atlas Aftercaches are warm we then simulate a DDoS attack by droppingsome fraction or all incoming DNS queries to each authoritative(We drop incoming traffic randomly with Linux iptables As suchpacket drop is not biased towards any recursive) After we begindropping traffic answers come either from caches at recursives orfor partial attacks from a lucky query that passes through

This emulation of DDoS captures traffic loss that occurs in DDoSattack as router queues overflow This emulation is not perfectsince we simulate loss at the last hop-router but in real DDoSattacks packets are often lost on access links near the target Ouremulation approximates this effect with one aggregate loss rate

DDoS attacks are also accompanied by queueing delay sincebuffers at and near the target are full We do not model queueingdelay although we do observe latency increasing due to retries Inmodern routers queueing delay due to full router buffers should beless than the retry interval In addition observations during real-world DDoS events show that the few queries that are successfulsee response times that are not much higher than typical [23]suggesting that loss (and not delay) is the dominant effect of DDoSin practice However a study that adds queueing latency to theattack model is interesting future work

52 Clients During Complete AuthoritativesFailure

We first evaluate the worst-case scenario for a DNS operator com-plete unreachability of all authoritative name servers Our goal isto understand when and for how long caches cover such an outage

Table 4 shows Experiments A B and C which simulate completefailure In Experiment A each VP makes only one query beforethe DDoS begins In Experiment B we allow several queries to takeplace and Experiment C allows several queries with a shorter TTL

Caches Protect Some We first consider Experiment A withone query that warms the cache immediately followed by the attackFigure 6a shows these responses over time with the onset of theattack the first downward arrow between 0 and 10 minutes andwith the cache expired after the second downward arrow between60 and 70 minutes We see that after the DDoS starts but before thecache has fully expired (between the downward arrows) initially30 and eventually 65 of queries fail with either no answer or aSERVFAIL error While not good this does mean that 35 to 70of queries during the DDoS are successfully served from the cacheBy contrast shortly after the cache expires almost all queries fail(only 25 VPs or 02 of the total seem to provide stale answers)

Caches Fill at Different Times In a more realistic scenarioVPs have filled their caches at different times In Experiment Acaches are freshly filled and should last for a full hour after thestart of attack Experiment B is designed for the opposite and worst

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiment ParametersTTL DDoS DDoS queries total probe failurein sec start dur before dur interval

A 3600 10 60 1 120 10 100 (both NSes)B 3600 60 60 6 240 10 100 (both NSes)C 1800 60 60 6 180 10 100 (both NSes)D 1800 60 60 6 180 10 50 (one NS)E 1800 60 60 6 180 10 50 (both NSes)F 1800 60 60 6 180 10 75 (both NSes)G 300 60 60 6 180 10 75 (both NSes)H 1800 60 60 6 180 10 90 (both NSes)I 60 60 60 6 180 10 90 (both NSes)

ResultsTotal Valid VPs Queries Total Validprobes probes answers answers

A 9224 8727 15339 136423 76619 76181B 9237 8827 15528 357102 293881 292564C 9261 8847 15578 258695 199185 198197D 9139 8708 15332 286231 273716 272231E 9153 8708 15320 285325 270179 268786F 9141 8727 15325 278741 259009 257740G 9206 8771 15481 274755 249958 249042H 9226 8778 15486 269030 242725 241569I 9224 8735 15388 253228 218831 217979

Table 4 DDoS emulation experiments [38] DDoS start durations and probe interval are given in minutes

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110

cache-only cache-expired

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment A 3600-10min-1down arrows indicate DDoS start andcache expiration

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment B 3600-10min-1down-1up arrows indicate DDoS startand recovery

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

cache-only

cache-expired

(c) Experiment C 1800-10min-1down-1up arrows indicate DDoS startcache expiration and recovery

Figure 6 Answers received during DDoS attacks

0

4000

8000

12000

16000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170answ

ers

minutes after start

AA

CC

CA

Figure 7 Timeseries of answers for Experiment B

case we begin warming the cache one hour before the attack andquery 6 times from each VP Other parameters are the same withthe attack lasting for 60 minutes (also the cache duration) but thenwe restore the authoritatives to service

Figure 6b shows the results of Experiment B While about 50 ofVPs are served from the cache in the first 10 minute round after theDDoS starts the fraction served drops quickly and is at only about3 one hour later Three factors are in play here most caches werefilled 60 minutes before the attack and are timing out in the firstround While the timeout and query rounds are both 60 minutesapart Atlas intentionally spreads queries out over 5 minutes sowe expect that some queries happen after 59 minutes and others 61minutes

Second we know some large recursives have fragmented caches(sect35) so we expect that some of the successes between times 70and 110 minutes are due to caches that were filled between times10 and 50 minutes This can actually be seen in Figure 7 where weshow a timeseries of the answers for Experiment B where we seeCC (correct cache responses) between times 60 and 90

Third we see an increase in the number of CA queries that areanswered by the cache with expired TTLs (Figure 7) This increaseis due to servers serving stale content [19]

Caches Eventually All Expire Finally we carry out a thirdemulation but with half the cache lifetime (1800 s or 30minutesrather than the full hour) Figure 6c shows response over timeThese results are similar to Experiment B with rapid fall-off whenthe attack starts as caches age After the attack has been underwayfor 30minutes all caches must have expired and we see only a few(about 26) residual successes

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

53 Discussion of Complete FailuresOverall we see that caching is partially successful in protecting dur-ing a DDoS With full valid caches half or more VPs get serviceHowever caches are filled at different times and expire so an op-erator cannot count on a full cache duration for any customerseven for popular (ldquoalways in the cacherdquo) domains The protectionprovided by caches depends on their state in the recursive resolversomething outside the operatorrsquos control In addition our evalua-tion of caching in sect3 showed that caches will end early for someVPs

Second we were surprised that a tiny fraction of VPs are suc-cessful after all caches should have timed out (after the 80 minutesperiod in Experiment A and between 90 and 110 minutes in Exper-iment C) These successes suggest an early deployment of ldquoservestalerdquo something currently under review in the IETF [19] is to servea previously known record beyond its TTL if authoritatives areunreachable with the goal of improving resilience under DDoS Weinvestigated the Experiment A where see that 1048 answers of the1140 successes in the second half of the outage These successesare from 471 VPs (and 215 recursives) most of them answered byOpenDNS and Google public DNS servers suggesting experimen-tation not yet widespread Out of these 1048 queries 1031 return aTTL value equals to 0 as specified in the IETF stale draft [19]

54 Client Reliability During PartialAuthoritative Failure

The previous section examined DDoS attacks that result in com-plete failure of all authoritatives but often DDoS attacks result inpartial failure with 50 or 90 packet loss at the authoritatives(For example consider the November 2015 DDoS attack on theDNS Root [23]) We next study experiments with partial failuresshowing that caching and retries together nearly fully protect 50DDoS events and protect half of VPs even during 90 events

We carry out several Experiments D to I in Table 4 We follow theprocedure outlined in sect51 looking at the DDoS-driven loss ratesof 50 75 and 90 with TTLs of 1800 s 300 s and 60 s Graphsomitted due to space can be found in Appendix D

Near-Full Protection from Caches During Moderate At-tacks We first consider Experiment E a ldquomildrdquo DDoS with 50loss with VP success over time in Figure 8a In spite of a loss ratethat would be crippling to TCP nearly all VPs are successful in DNSThis success is due to two factors first we know that many clientsare served from caches as was shown in Experiment A with fullloss (Figure 6a) Second most recursives retry queries so they re-cover from loss of a single packet and are able to provide an answerTogether these mean that failures during the first 30 minutes of theevent is 85 slightly higher than the 48 fraction of failures beforethe DDoS For this experiment the TTL is 1800 s (30minutes) sowe might expect failures to increase halfway through the DDoSWe do not see any increase in failures because caching and retriesare synergistic a successful retried query will place the answer in acache for a later query The importance of this result is that DNScan survive moderate-size attacks when caching is possible While apositive retries do increase latency something we study in sect55

Attack IntensityMattersWhile clients do quite well with 50loss at all authoritatives failures increase with the intensity of theattack

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

50 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment E (1800-50p-10min) 50 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

Ans

wer

sminutes after start

OK SERVFAIL No answer

(b) Experiment F (1800-75p-10min) 75 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(c) Experiment H (1800-90p-10min) 90 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(d) Experiment I (60-90p-10min) 90 packet loss

Figure 8 Answers received during DDoS attacks first andsecond vertical lines show start and end of DDoS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiments F and H shown in Figure 8b and Figure 8c increasethe loss rate to 75 and 90We see the number of failures increasesto about 190 with 75 loss and 403 with 90 loss It is importantto note that roughly 60 the clients are still served even with 90loss

We also see that this level of success is consistent over the entirehour-long DDoS event even though the cache duration is only30minutes This consistency confirms the importance of cachingand retries in combination

To verify the effects of this interaction Experiment I changesthe caching duration to 60 s less than one round or probing Com-paring Experiment I in Figure 8d to H in Figure 8c we see that thefailure rate increases from 30 to about 63 However even withno caching about 37 of queries still are answered due to resolversthat serve stale content and recursives retries We investigate retriesin sect6

55 Client Latency During Partial AuthoritativeFailure

We showed that client reliability is higher than expected duringfailures (sect54) due to a combination of caching and retries Wenext consider client latency Latency will increase during the DDoSbecause of retries and queueing delay but we will show that latencyincreases less than one might expect due to caching

To examine latency we return to Experiments D through I (Ta-ble 4) but look at latency (time to complete a query) rather thansuccess For these experiments clients timeout after 5 s

Figures 9a to 9d show latency during each emulated DDoS sce-nario (experiments with figures omitted here are in Appendix D La-tencies are not evenly distributed since some requests get throughimmediately while others must be retried one or more times so inaddition to mean we show 50 75 and 90 quantiles to characterizethe tail of the distribution

We emulate DDoS by dropping requests (sect51) and hence laten-cies reflect retries and loss but not queueing delay underrepresent-ing latency in real-world attacks However their shape (some lowlatency and a few long) is consistent with and helps explain whathas been seen in the past [23]

Beginning with Experiment E the moderate attack in Figure 9awe see no change to median latency This result is consistent withmany queries being handled by the cache and half of those nothandled by the cache getting through anyway We do see higherlatency in the 90ile tail reflecting successful retries This tail alsoincreases the mean some

This trend increases in Experiment F in Figure 9b where 75 ofqueries are lost Now we see the 75ile tail has increased as hasthe number of unanswered queries and the 90ile is twice as longas in Experiment E

We see the same latency in Experiment H with DDoS causing90 loss We set the timeouts to 5 s so the larger attack results inmore unsuccessful queries but latency for successful queries is notmuch worse than with 75 loss Median latency is still low due tocached replies

Finally Experiment I greatly reduces opportunities for cachingby reducing cache lifetime to one minute Figure 9d shows that lossof caching increases median RTT and significantly increases thetail latency Compared with Figure 9c (same packet loss ratio but

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment E 50 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160la

tency (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment F 75 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(c) Experiment H 90 packet loss(1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(d) Experiment I 90 packet loss (60 s TTL)

Figure 9 Latency results Shaded area indicates the intervalof an ongoing DDoS attack

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

1800 s TTL) we can clearly see the benefits of caching in terms oflatency (in addition to reliability) a half-hour TTL value reducedthe latency from 1300ms to 390ms Longer TTLs also help reducetail latency relative to shorter TTLs (compare for example the90ile RTT in Experiments I vs H in Figure 9)

Summary DDoS effects often increase client latency For mod-erate attacks increased latency is seen only by a few ldquounluckyrdquoclients whose do not see a full cache and whose queries are lostCaching has an important role in reducing latency during DDoSbut while it can often mitigate most reliability problems it cannotavoid latency penalties for all VPs Even when caching is not avail-able roughly 40 of clients get an answer either by serving staleor retries as we investigate next

6 THE AUTHORITATIVErsquoS PERSPECTIVEResults of partial DDoS events (sect54) show that DNS is surprisinglyreliablemdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches These results aredue to a combination of caching and retries We next examine thisfrom the perspective of the authoritative server

61 Recursive-Authoritative Traffic during aDDoS

We first ask are retries by recursive resolvers responsible for thesuccess rates observed in sect54 To investigate this question wereturn the partial DDoS experiments and look at how many queriesare sent to the authoritative servers We measure queries beforethey are dropped by our simulated DDoS Recursives must makemultiple queries to resolve a name We break out each type of queryfor the nameserver (NS) the nameserverrsquos IPv4 and v6 addresses(A-for-NS and AAAA-for-NS) and finally the desired query (AAAA-for-PID) Note that the authoritative is IPv4 only so AAAA-for-NSis non-existent and subject to negative caching while the otherrecords exist and use regular caching

We begin with the DDoS causing 75 loss in Figure 10a For thisexperiment we observe 18407 unique IP addresses of recursives(Rn ) querying for AAAA records directly to our authoritativesDuring the DDoS queries increase by about 35times We expect 4trials since the expected number of tries until success with lossrate p is (1 minus p)minus1 For this scenario results are cached for up to30 minutes so successful queries are reused in recursive cachesThis increase occurs both for the target AAAA record and alsofor the non-existent AAAA-for-NS records Negative caching forour zone is configured to 60 s making caching of NXDOMAINs forAAAA-for-NS less effective than positive caches

The offered load on the server increases further with more loss(90) as shown in Experiment H (Figure 10b) The higher loss rateresults in a much higher offered load on the server average 82timesnormal

Finally in Figure 10c we reduce the effects of caching at a 90DDoS and with a TTL of 60 s Here we see also about 81times morequeries at the server before the attack Comparing this case toExperiment H caching reduces the offered load on the server byabout 40

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(a) Experiment F 1800-75p-10min 75 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(b) Experiment H 1800-90p-10min 90 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(c) Experiment I 60-90p-10min 90 packet loss

Figure 10 Number of received queries by the authoritativeservers Shaded area indicates the interval of an ongoingDDoS attack

Implications The implication of this analysis is that legitimateclients ldquohammerrdquo with retries the already-stressed server during aDDoS For clients retries are important to get reliability and eachclient independently chooses to retry

The server is already under stress due to the DDoS so these re-tries add to that stress However the DDoS traffic is almost certainlymuch larger than the retried of legitimate traffic (A server experi-encing a volumetric attack causing 90 loss must be receiving 10timesits capacity Regular traffic is a small fraction of normal capacity soeven 4times regular is still much less than the attack traffic) The multi-plier for retried legitimate traffic depends on the implementationsstub and recursive resolver as well as application-level retries anddefection (users hitting reload in their browser and later giving up)Our experiment omits application-level retries and likely gives a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

lower bound We next examine specific recursive implementationsto see their behavior

62 Sources of Retries Software and Multi-levelRecursives

Experiments in the prior section showed that recursive resolversldquohammerrdquo authoritatives when queries are dropped We reexamineDNS software (since 2012 [56]) and additionally show deploymentsamplify retries

Recursive Software Prior work showed that recursive serversretry many times when an authoritative is unresponsive [56] withevaluation of BIND 97 and 98 DNSCache UnboundWindowsDNSand PowerDNS We studied retries in BIND 9103 and Unbound158 to quantify the number of retries Examining only requestsfor AAAA records we see that normal requests with a responsiveauthoritative ask for the AAAA records for all authoritatives andthe target name (3 total requests when there are 2 authoritatives)When all authoritatives are unavailable we see about 7times morerequests before the recursives time out (Exact numbers vary indifferent runs but typically each request is made 6 or 7 times)Such retries are appropriate provided they are paced (both useexponential backoff) they explain part of the increase in legitimatetraffic during DDoS events Full data is in Appendix E

Recursive DeploymentAnother source of extra retries is com-plex recursive deployments We showed that operators of largerecursives often use complex multi-level resolution infrastructure(sect35) This infrastructure can amplify the number of retries duringreachability problems at authoritatives

To quantify amplification we count both the number of Rn re-cursives and AAAA queries for each probe ID reaching our author-itatives Figure 11 show the results for Experiment I These valuesrepresent the amplification in two ways during stress more Rnrecursives will be used for each probe ID and these Rn will generatemore queries to the already stressed authoritatives As the figuresshow the median number of Rn recursives employed doubles (from1 to 2) during the DDoS event as does the 90ile (from 2 to 4) Themaximum rises to 39 The number of queries for each probe IDgrows more than 3times from 2 to 7 Worse the 90ile grows morethan 6times (3 queries to 18) The maximum grows 535times reachingup to 286 queries for one single probe ID This value however isa lower bound given there are a large number of A and AAAAqueries that ask for NS records and not the probe ID (AAAA andA-for NS in Figure 10)

We can also look at the aggregate effects of retries created by thecomplex recursive infrastructure Figure 12 shows the timeseries ofunique IP addresses of Rn observed at the authoritatives Before theDDoS period for Experiment I with TTL of 60 s we see a constantnumber of recursives reaching our authoritatives ie all queriesshould be answered by authoritatives (no caching at this TTL value)For experiments F and H both with TTL of 1800 s the number ofrecursives reaching our authoritative oscillates before the DDoSpeaks are observed when caches expire as expected

During the DDoS we observe a similar behavior for all threeexperiments in Figure 12 as packets are dropped at the authori-tative (at rates of 75 90 and 90 for F H and I respectively) wesee an increase on the number of Rn recursives querying our au-thoritatives for experiments F and H we see drops when caching

1

10

100

1000

0 20 40 60 80 100 120 140 160 180 1

10

100

1000

Rn-p

erP

ID

AA

AA

-for-

PID

minutes after start

Rn-per-PID-medianRn-per-PID-90-tileRn-per-PID-max

AAAA-for-PID-medianAAAA-for-PID-90-tileAAAA-for-PID-max

Figure 11 Rn recursives and AAAA queries used in Experi-ment I normalized by the number of probe IDs

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140 160 180

R

n r

eachin

g A

T

minutes after start

Experiment FExperiment HExperiment I

Figure 12 Unique Rn recursives addresses observed at au-thoritatives

is expected but not for experiment I The reason for this behavioris that the underlying layer of recursives starts forwarding queriesto other recursives which is amplified in the end (We show thisbehavior for an individual probe in Appendix F where we observethe growth in the number of queries received at the authoritativesand the number of recursives used)

Most complex resolution infrastructures are proprietary (as faras we know only one study has examined them [48]) so we cannotmake recommendations about how large recursive resolvers oughtto behave We suggest that the aggregate traffic of large recursiveresolvers should strive to be within a constant factor of singlerecursives perhaps a factor of 4 We also encourage additionalstudy of large recursive resolvers and their operators to shareinformation about their behavior

7 RELATEDWORKCaching by Recursives Several groups have shown that DNScaching can be imperfect Hao and Wang analyzed the impact ofnonce domains on DNS recursiversquos caches [13] Using two weeksof data from two universities they showed that filtering one-timedomains improves cache hit rates In two studies Pang et al [31 32]reported that web clients and local recursives do not always honorTTL values provided by authoritatives Almeida et al [2] analyzedDNS traces of a mobile operator and used a mobile applicationto see TTLS in practice They find that most domains have shortTTLs (less than 60 s) and report and evidence of TTL manipulationby recursives Schomp et al [48] demonstrate widespread use of

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

multi-level recursives by large operators as well as TTL manip-ulation Our work builds on this prior work examining cachingand TTL manipulation systematically and considering its effectson resilience

DNS client behavior Yu et al investigated how stubs and re-cursives select authoritative servers and were the first to demon-strate the large number of retries when all authoritatives are un-available [56] We also investigated how recursives select authori-tative servers in the wild and found that recursives tend to preferauthoritatives with shorter latency but query all authoritatives fordiversity [27] We confirm Yursquos work and focus on authoritativeselection during DDoS from several perspectives

Authoritatives during DDoS We investigated how the RootDNS service behaved during the Nov 2015 DDoS attacks [23] Thisreport focuses on the interactions of IP anycast and both latencyand reachability as seen from RIPE Atlas Rather than look at ag-gregate behavior and anycast our methodology here examines howclients interact with their recursive resolvers while this prior workfocused on authoritatives only bypassing recursives In additionhere we have full access to clients and authoritatives traffic dur-ing our experiments and we evaluate DDoS with controlled lossrates The prior study has incomplete data and focuses on specificresults of two events These differences stem from their study ofnatural experiments from real-world events and our controlledexperiments

8 IMPLICATIONSWe evaluated DNS resilience showing that caches and retries canmitigate much of the harm from a DDoS attack provided the cacheis full and some requests can get to authoritative servers The keyimplication of our study is to explain differences in the outcome ofrecent DDoS attacks

Recent DDoS attacks on DNS services have seen very differ-ent outcomes for users The Root Server System was a target inNov 2015 [41] and June 2016 [42] The DNS Root has 13 letterseach an authoritative ldquoserverrdquo implemented with some or many IPanycast instances Analysis of these DDoS events showed that theireffects were uneven across letters for some most or all anycastinstances showed high loss while other letters showed little or noloss [23] However the Root Operators state ldquoThere are no knownreports of end-user visible error conditions during and as a resultof this incident Because the DNS protocol is designed to cope withpartial reachability rdquo [41]

In Oct 2016 a much larger attack was directed at Dyn a providerof DNS service for many second-level domains [14] Although Dynhas a capable infrastructure and immediately took steps to addressservice problems there were reports of user-visible service disrup-tion in the technical and even popular press [34] Reports describeintermittent failure of prominent websites including ldquoTwitter Net-flix Spotify Airbnb Reddit Etsy SoundCloud and The New YorkTimesrdquo each a direct or indirect customer of Dyn at the time

Our work helps explain these very different outcomes The RootDNS saw few or no user-visible problems because data in the rootzone is cachable for a day or more and because multiple letters andmany anycast instances were continuously available (All measure-ments in this paragraph are as of 2018-05-22) Records in the rootzone have TTLs of 1 to 6 days and wwwroot-serversorg reports

922 anycast instances operating across the 13 authoritative serversDyn also operates a large infrastructure (httpsdyncomdnsnetwork-map reports 20 ldquofacilitiesrdquo) and faced a larger attack (reportsof 12 Tbs [47] compared to estimates of 35 Gbs for the Nov 2015root attack [23]) But a key difference is all of the Dynrsquos customerslisted above use DNS-based CDNs (for a description see [7]) withmultiple Dyn-hosted DNS components with TTLs that range from120 to 300 s

In addition to explaining the effects our experiments help get tothe root causes behind these outcomes Users of the Root benefitedfrom caching and saw performance like Experiment E (Figure 8a)because root contents (TLDs like com and country codes) are popu-lar and certainly cached in recursives and because some root letterswere always available to refresh caches (either through a successfulnormal query or a retry) By contrast users requiring domainswith very short TTLs (like the websites that had problems) receiveperformance more like Experiment I (Figure 8d) or Experiment C(Figure 6c) Even when some requests succeed an cache a popularname short TTLs cause caches to clear quickly

This example shows the importance of DNSrsquos multiple methodsof resilience (caching retries and at least some availability at oneauthoritative) It suggests that CDN operators may wish to considerlonger timeouts to allow caching to help and give DNS operatorsdeploy defenses Experiment H suggests 30 minutes Figure 8c

Configuring short TTLs serves a role in CDNs that use DNSto direct clients to different application-level servers Short TTLsallow for re-provisioning during DDoS attacks on web servers butthat leaves DNS servers vulnerable This tension suggests trafficscrubbing by routing changeswith longDNS TTLsmay be preferredto short DNS TTLs so that both layers can be robust Howeverthe complexity of interactions between DNS at multiple levels andCDNs suggests that more study is needed before recommendingspecific settings

Finally this evaluation helps complete our picture of DNS la-tency and reliability for DNS services that may consist of multipleauthoritatives some or all using IP anycast with multiple sites Tominimize latency prior work has shown a single authoritative usingIP anycast should maximize geographic dispersion of sites [46] Thelatency of an overall DNS service with multiple authoritatives canbe limited by the one with largest latency [27] Prior work aboutresilience to DDoS attack has shown that individual IP anycastsites will suffer under DDoS as a function of the attack traffic thatsite receives relative to its capacity [23] We show that the overallreliance of a DNS service composed of multiple authoritatives usingIP anycast tends to be as resilient as the strongest individual author-itative The reason for these opposite results is that in both casesrecursive resolvers will try all authoritatives of a given service Forlatency they will sometimes choose a distant authoritative butfor resilience they will continue until they find the most availableauthoritative

9 CONCLUSIONSThis paper represents the first study of how the DNS resolutionsystem behaves when authoritative servers are under DDoS attackCaching and retries at recursive resolvers are key factors in thisbehavior We show that together caching and retries by recursiveresolvers greatly improve the resilience of the DNS as a whole In

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 8: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiment ParametersTTL DDoS DDoS queries total probe failurein sec start dur before dur interval

A 3600 10 60 1 120 10 100 (both NSes)B 3600 60 60 6 240 10 100 (both NSes)C 1800 60 60 6 180 10 100 (both NSes)D 1800 60 60 6 180 10 50 (one NS)E 1800 60 60 6 180 10 50 (both NSes)F 1800 60 60 6 180 10 75 (both NSes)G 300 60 60 6 180 10 75 (both NSes)H 1800 60 60 6 180 10 90 (both NSes)I 60 60 60 6 180 10 90 (both NSes)

ResultsTotal Valid VPs Queries Total Validprobes probes answers answers

A 9224 8727 15339 136423 76619 76181B 9237 8827 15528 357102 293881 292564C 9261 8847 15578 258695 199185 198197D 9139 8708 15332 286231 273716 272231E 9153 8708 15320 285325 270179 268786F 9141 8727 15325 278741 259009 257740G 9206 8771 15481 274755 249958 249042H 9226 8778 15486 269030 242725 241569I 9224 8735 15388 253228 218831 217979

Table 4 DDoS emulation experiments [38] DDoS start durations and probe interval are given in minutes

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110

cache-only cache-expired

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment A 3600-10min-1down arrows indicate DDoS start andcache expiration

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment B 3600-10min-1down-1up arrows indicate DDoS startand recovery

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

cache-only

cache-expired

(c) Experiment C 1800-10min-1down-1up arrows indicate DDoS startcache expiration and recovery

Figure 6 Answers received during DDoS attacks

0

4000

8000

12000

16000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170answ

ers

minutes after start

AA

CC

CA

Figure 7 Timeseries of answers for Experiment B

case we begin warming the cache one hour before the attack andquery 6 times from each VP Other parameters are the same withthe attack lasting for 60 minutes (also the cache duration) but thenwe restore the authoritatives to service

Figure 6b shows the results of Experiment B While about 50 ofVPs are served from the cache in the first 10 minute round after theDDoS starts the fraction served drops quickly and is at only about3 one hour later Three factors are in play here most caches werefilled 60 minutes before the attack and are timing out in the firstround While the timeout and query rounds are both 60 minutesapart Atlas intentionally spreads queries out over 5 minutes sowe expect that some queries happen after 59 minutes and others 61minutes

Second we know some large recursives have fragmented caches(sect35) so we expect that some of the successes between times 70and 110 minutes are due to caches that were filled between times10 and 50 minutes This can actually be seen in Figure 7 where weshow a timeseries of the answers for Experiment B where we seeCC (correct cache responses) between times 60 and 90

Third we see an increase in the number of CA queries that areanswered by the cache with expired TTLs (Figure 7) This increaseis due to servers serving stale content [19]

Caches Eventually All Expire Finally we carry out a thirdemulation but with half the cache lifetime (1800 s or 30minutesrather than the full hour) Figure 6c shows response over timeThese results are similar to Experiment B with rapid fall-off whenthe attack starts as caches age After the attack has been underwayfor 30minutes all caches must have expired and we see only a few(about 26) residual successes

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

53 Discussion of Complete FailuresOverall we see that caching is partially successful in protecting dur-ing a DDoS With full valid caches half or more VPs get serviceHowever caches are filled at different times and expire so an op-erator cannot count on a full cache duration for any customerseven for popular (ldquoalways in the cacherdquo) domains The protectionprovided by caches depends on their state in the recursive resolversomething outside the operatorrsquos control In addition our evalua-tion of caching in sect3 showed that caches will end early for someVPs

Second we were surprised that a tiny fraction of VPs are suc-cessful after all caches should have timed out (after the 80 minutesperiod in Experiment A and between 90 and 110 minutes in Exper-iment C) These successes suggest an early deployment of ldquoservestalerdquo something currently under review in the IETF [19] is to servea previously known record beyond its TTL if authoritatives areunreachable with the goal of improving resilience under DDoS Weinvestigated the Experiment A where see that 1048 answers of the1140 successes in the second half of the outage These successesare from 471 VPs (and 215 recursives) most of them answered byOpenDNS and Google public DNS servers suggesting experimen-tation not yet widespread Out of these 1048 queries 1031 return aTTL value equals to 0 as specified in the IETF stale draft [19]

54 Client Reliability During PartialAuthoritative Failure

The previous section examined DDoS attacks that result in com-plete failure of all authoritatives but often DDoS attacks result inpartial failure with 50 or 90 packet loss at the authoritatives(For example consider the November 2015 DDoS attack on theDNS Root [23]) We next study experiments with partial failuresshowing that caching and retries together nearly fully protect 50DDoS events and protect half of VPs even during 90 events

We carry out several Experiments D to I in Table 4 We follow theprocedure outlined in sect51 looking at the DDoS-driven loss ratesof 50 75 and 90 with TTLs of 1800 s 300 s and 60 s Graphsomitted due to space can be found in Appendix D

Near-Full Protection from Caches During Moderate At-tacks We first consider Experiment E a ldquomildrdquo DDoS with 50loss with VP success over time in Figure 8a In spite of a loss ratethat would be crippling to TCP nearly all VPs are successful in DNSThis success is due to two factors first we know that many clientsare served from caches as was shown in Experiment A with fullloss (Figure 6a) Second most recursives retry queries so they re-cover from loss of a single packet and are able to provide an answerTogether these mean that failures during the first 30 minutes of theevent is 85 slightly higher than the 48 fraction of failures beforethe DDoS For this experiment the TTL is 1800 s (30minutes) sowe might expect failures to increase halfway through the DDoSWe do not see any increase in failures because caching and retriesare synergistic a successful retried query will place the answer in acache for a later query The importance of this result is that DNScan survive moderate-size attacks when caching is possible While apositive retries do increase latency something we study in sect55

Attack IntensityMattersWhile clients do quite well with 50loss at all authoritatives failures increase with the intensity of theattack

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

50 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment E (1800-50p-10min) 50 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

Ans

wer

sminutes after start

OK SERVFAIL No answer

(b) Experiment F (1800-75p-10min) 75 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(c) Experiment H (1800-90p-10min) 90 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(d) Experiment I (60-90p-10min) 90 packet loss

Figure 8 Answers received during DDoS attacks first andsecond vertical lines show start and end of DDoS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiments F and H shown in Figure 8b and Figure 8c increasethe loss rate to 75 and 90We see the number of failures increasesto about 190 with 75 loss and 403 with 90 loss It is importantto note that roughly 60 the clients are still served even with 90loss

We also see that this level of success is consistent over the entirehour-long DDoS event even though the cache duration is only30minutes This consistency confirms the importance of cachingand retries in combination

To verify the effects of this interaction Experiment I changesthe caching duration to 60 s less than one round or probing Com-paring Experiment I in Figure 8d to H in Figure 8c we see that thefailure rate increases from 30 to about 63 However even withno caching about 37 of queries still are answered due to resolversthat serve stale content and recursives retries We investigate retriesin sect6

55 Client Latency During Partial AuthoritativeFailure

We showed that client reliability is higher than expected duringfailures (sect54) due to a combination of caching and retries Wenext consider client latency Latency will increase during the DDoSbecause of retries and queueing delay but we will show that latencyincreases less than one might expect due to caching

To examine latency we return to Experiments D through I (Ta-ble 4) but look at latency (time to complete a query) rather thansuccess For these experiments clients timeout after 5 s

Figures 9a to 9d show latency during each emulated DDoS sce-nario (experiments with figures omitted here are in Appendix D La-tencies are not evenly distributed since some requests get throughimmediately while others must be retried one or more times so inaddition to mean we show 50 75 and 90 quantiles to characterizethe tail of the distribution

We emulate DDoS by dropping requests (sect51) and hence laten-cies reflect retries and loss but not queueing delay underrepresent-ing latency in real-world attacks However their shape (some lowlatency and a few long) is consistent with and helps explain whathas been seen in the past [23]

Beginning with Experiment E the moderate attack in Figure 9awe see no change to median latency This result is consistent withmany queries being handled by the cache and half of those nothandled by the cache getting through anyway We do see higherlatency in the 90ile tail reflecting successful retries This tail alsoincreases the mean some

This trend increases in Experiment F in Figure 9b where 75 ofqueries are lost Now we see the 75ile tail has increased as hasthe number of unanswered queries and the 90ile is twice as longas in Experiment E

We see the same latency in Experiment H with DDoS causing90 loss We set the timeouts to 5 s so the larger attack results inmore unsuccessful queries but latency for successful queries is notmuch worse than with 75 loss Median latency is still low due tocached replies

Finally Experiment I greatly reduces opportunities for cachingby reducing cache lifetime to one minute Figure 9d shows that lossof caching increases median RTT and significantly increases thetail latency Compared with Figure 9c (same packet loss ratio but

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment E 50 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160la

tency (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment F 75 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(c) Experiment H 90 packet loss(1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(d) Experiment I 90 packet loss (60 s TTL)

Figure 9 Latency results Shaded area indicates the intervalof an ongoing DDoS attack

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

1800 s TTL) we can clearly see the benefits of caching in terms oflatency (in addition to reliability) a half-hour TTL value reducedthe latency from 1300ms to 390ms Longer TTLs also help reducetail latency relative to shorter TTLs (compare for example the90ile RTT in Experiments I vs H in Figure 9)

Summary DDoS effects often increase client latency For mod-erate attacks increased latency is seen only by a few ldquounluckyrdquoclients whose do not see a full cache and whose queries are lostCaching has an important role in reducing latency during DDoSbut while it can often mitigate most reliability problems it cannotavoid latency penalties for all VPs Even when caching is not avail-able roughly 40 of clients get an answer either by serving staleor retries as we investigate next

6 THE AUTHORITATIVErsquoS PERSPECTIVEResults of partial DDoS events (sect54) show that DNS is surprisinglyreliablemdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches These results aredue to a combination of caching and retries We next examine thisfrom the perspective of the authoritative server

61 Recursive-Authoritative Traffic during aDDoS

We first ask are retries by recursive resolvers responsible for thesuccess rates observed in sect54 To investigate this question wereturn the partial DDoS experiments and look at how many queriesare sent to the authoritative servers We measure queries beforethey are dropped by our simulated DDoS Recursives must makemultiple queries to resolve a name We break out each type of queryfor the nameserver (NS) the nameserverrsquos IPv4 and v6 addresses(A-for-NS and AAAA-for-NS) and finally the desired query (AAAA-for-PID) Note that the authoritative is IPv4 only so AAAA-for-NSis non-existent and subject to negative caching while the otherrecords exist and use regular caching

We begin with the DDoS causing 75 loss in Figure 10a For thisexperiment we observe 18407 unique IP addresses of recursives(Rn ) querying for AAAA records directly to our authoritativesDuring the DDoS queries increase by about 35times We expect 4trials since the expected number of tries until success with lossrate p is (1 minus p)minus1 For this scenario results are cached for up to30 minutes so successful queries are reused in recursive cachesThis increase occurs both for the target AAAA record and alsofor the non-existent AAAA-for-NS records Negative caching forour zone is configured to 60 s making caching of NXDOMAINs forAAAA-for-NS less effective than positive caches

The offered load on the server increases further with more loss(90) as shown in Experiment H (Figure 10b) The higher loss rateresults in a much higher offered load on the server average 82timesnormal

Finally in Figure 10c we reduce the effects of caching at a 90DDoS and with a TTL of 60 s Here we see also about 81times morequeries at the server before the attack Comparing this case toExperiment H caching reduces the offered load on the server byabout 40

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(a) Experiment F 1800-75p-10min 75 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(b) Experiment H 1800-90p-10min 90 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(c) Experiment I 60-90p-10min 90 packet loss

Figure 10 Number of received queries by the authoritativeservers Shaded area indicates the interval of an ongoingDDoS attack

Implications The implication of this analysis is that legitimateclients ldquohammerrdquo with retries the already-stressed server during aDDoS For clients retries are important to get reliability and eachclient independently chooses to retry

The server is already under stress due to the DDoS so these re-tries add to that stress However the DDoS traffic is almost certainlymuch larger than the retried of legitimate traffic (A server experi-encing a volumetric attack causing 90 loss must be receiving 10timesits capacity Regular traffic is a small fraction of normal capacity soeven 4times regular is still much less than the attack traffic) The multi-plier for retried legitimate traffic depends on the implementationsstub and recursive resolver as well as application-level retries anddefection (users hitting reload in their browser and later giving up)Our experiment omits application-level retries and likely gives a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

lower bound We next examine specific recursive implementationsto see their behavior

62 Sources of Retries Software and Multi-levelRecursives

Experiments in the prior section showed that recursive resolversldquohammerrdquo authoritatives when queries are dropped We reexamineDNS software (since 2012 [56]) and additionally show deploymentsamplify retries

Recursive Software Prior work showed that recursive serversretry many times when an authoritative is unresponsive [56] withevaluation of BIND 97 and 98 DNSCache UnboundWindowsDNSand PowerDNS We studied retries in BIND 9103 and Unbound158 to quantify the number of retries Examining only requestsfor AAAA records we see that normal requests with a responsiveauthoritative ask for the AAAA records for all authoritatives andthe target name (3 total requests when there are 2 authoritatives)When all authoritatives are unavailable we see about 7times morerequests before the recursives time out (Exact numbers vary indifferent runs but typically each request is made 6 or 7 times)Such retries are appropriate provided they are paced (both useexponential backoff) they explain part of the increase in legitimatetraffic during DDoS events Full data is in Appendix E

Recursive DeploymentAnother source of extra retries is com-plex recursive deployments We showed that operators of largerecursives often use complex multi-level resolution infrastructure(sect35) This infrastructure can amplify the number of retries duringreachability problems at authoritatives

To quantify amplification we count both the number of Rn re-cursives and AAAA queries for each probe ID reaching our author-itatives Figure 11 show the results for Experiment I These valuesrepresent the amplification in two ways during stress more Rnrecursives will be used for each probe ID and these Rn will generatemore queries to the already stressed authoritatives As the figuresshow the median number of Rn recursives employed doubles (from1 to 2) during the DDoS event as does the 90ile (from 2 to 4) Themaximum rises to 39 The number of queries for each probe IDgrows more than 3times from 2 to 7 Worse the 90ile grows morethan 6times (3 queries to 18) The maximum grows 535times reachingup to 286 queries for one single probe ID This value however isa lower bound given there are a large number of A and AAAAqueries that ask for NS records and not the probe ID (AAAA andA-for NS in Figure 10)

We can also look at the aggregate effects of retries created by thecomplex recursive infrastructure Figure 12 shows the timeseries ofunique IP addresses of Rn observed at the authoritatives Before theDDoS period for Experiment I with TTL of 60 s we see a constantnumber of recursives reaching our authoritatives ie all queriesshould be answered by authoritatives (no caching at this TTL value)For experiments F and H both with TTL of 1800 s the number ofrecursives reaching our authoritative oscillates before the DDoSpeaks are observed when caches expire as expected

During the DDoS we observe a similar behavior for all threeexperiments in Figure 12 as packets are dropped at the authori-tative (at rates of 75 90 and 90 for F H and I respectively) wesee an increase on the number of Rn recursives querying our au-thoritatives for experiments F and H we see drops when caching

1

10

100

1000

0 20 40 60 80 100 120 140 160 180 1

10

100

1000

Rn-p

erP

ID

AA

AA

-for-

PID

minutes after start

Rn-per-PID-medianRn-per-PID-90-tileRn-per-PID-max

AAAA-for-PID-medianAAAA-for-PID-90-tileAAAA-for-PID-max

Figure 11 Rn recursives and AAAA queries used in Experi-ment I normalized by the number of probe IDs

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140 160 180

R

n r

eachin

g A

T

minutes after start

Experiment FExperiment HExperiment I

Figure 12 Unique Rn recursives addresses observed at au-thoritatives

is expected but not for experiment I The reason for this behavioris that the underlying layer of recursives starts forwarding queriesto other recursives which is amplified in the end (We show thisbehavior for an individual probe in Appendix F where we observethe growth in the number of queries received at the authoritativesand the number of recursives used)

Most complex resolution infrastructures are proprietary (as faras we know only one study has examined them [48]) so we cannotmake recommendations about how large recursive resolvers oughtto behave We suggest that the aggregate traffic of large recursiveresolvers should strive to be within a constant factor of singlerecursives perhaps a factor of 4 We also encourage additionalstudy of large recursive resolvers and their operators to shareinformation about their behavior

7 RELATEDWORKCaching by Recursives Several groups have shown that DNScaching can be imperfect Hao and Wang analyzed the impact ofnonce domains on DNS recursiversquos caches [13] Using two weeksof data from two universities they showed that filtering one-timedomains improves cache hit rates In two studies Pang et al [31 32]reported that web clients and local recursives do not always honorTTL values provided by authoritatives Almeida et al [2] analyzedDNS traces of a mobile operator and used a mobile applicationto see TTLS in practice They find that most domains have shortTTLs (less than 60 s) and report and evidence of TTL manipulationby recursives Schomp et al [48] demonstrate widespread use of

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

multi-level recursives by large operators as well as TTL manip-ulation Our work builds on this prior work examining cachingand TTL manipulation systematically and considering its effectson resilience

DNS client behavior Yu et al investigated how stubs and re-cursives select authoritative servers and were the first to demon-strate the large number of retries when all authoritatives are un-available [56] We also investigated how recursives select authori-tative servers in the wild and found that recursives tend to preferauthoritatives with shorter latency but query all authoritatives fordiversity [27] We confirm Yursquos work and focus on authoritativeselection during DDoS from several perspectives

Authoritatives during DDoS We investigated how the RootDNS service behaved during the Nov 2015 DDoS attacks [23] Thisreport focuses on the interactions of IP anycast and both latencyand reachability as seen from RIPE Atlas Rather than look at ag-gregate behavior and anycast our methodology here examines howclients interact with their recursive resolvers while this prior workfocused on authoritatives only bypassing recursives In additionhere we have full access to clients and authoritatives traffic dur-ing our experiments and we evaluate DDoS with controlled lossrates The prior study has incomplete data and focuses on specificresults of two events These differences stem from their study ofnatural experiments from real-world events and our controlledexperiments

8 IMPLICATIONSWe evaluated DNS resilience showing that caches and retries canmitigate much of the harm from a DDoS attack provided the cacheis full and some requests can get to authoritative servers The keyimplication of our study is to explain differences in the outcome ofrecent DDoS attacks

Recent DDoS attacks on DNS services have seen very differ-ent outcomes for users The Root Server System was a target inNov 2015 [41] and June 2016 [42] The DNS Root has 13 letterseach an authoritative ldquoserverrdquo implemented with some or many IPanycast instances Analysis of these DDoS events showed that theireffects were uneven across letters for some most or all anycastinstances showed high loss while other letters showed little or noloss [23] However the Root Operators state ldquoThere are no knownreports of end-user visible error conditions during and as a resultof this incident Because the DNS protocol is designed to cope withpartial reachability rdquo [41]

In Oct 2016 a much larger attack was directed at Dyn a providerof DNS service for many second-level domains [14] Although Dynhas a capable infrastructure and immediately took steps to addressservice problems there were reports of user-visible service disrup-tion in the technical and even popular press [34] Reports describeintermittent failure of prominent websites including ldquoTwitter Net-flix Spotify Airbnb Reddit Etsy SoundCloud and The New YorkTimesrdquo each a direct or indirect customer of Dyn at the time

Our work helps explain these very different outcomes The RootDNS saw few or no user-visible problems because data in the rootzone is cachable for a day or more and because multiple letters andmany anycast instances were continuously available (All measure-ments in this paragraph are as of 2018-05-22) Records in the rootzone have TTLs of 1 to 6 days and wwwroot-serversorg reports

922 anycast instances operating across the 13 authoritative serversDyn also operates a large infrastructure (httpsdyncomdnsnetwork-map reports 20 ldquofacilitiesrdquo) and faced a larger attack (reportsof 12 Tbs [47] compared to estimates of 35 Gbs for the Nov 2015root attack [23]) But a key difference is all of the Dynrsquos customerslisted above use DNS-based CDNs (for a description see [7]) withmultiple Dyn-hosted DNS components with TTLs that range from120 to 300 s

In addition to explaining the effects our experiments help get tothe root causes behind these outcomes Users of the Root benefitedfrom caching and saw performance like Experiment E (Figure 8a)because root contents (TLDs like com and country codes) are popu-lar and certainly cached in recursives and because some root letterswere always available to refresh caches (either through a successfulnormal query or a retry) By contrast users requiring domainswith very short TTLs (like the websites that had problems) receiveperformance more like Experiment I (Figure 8d) or Experiment C(Figure 6c) Even when some requests succeed an cache a popularname short TTLs cause caches to clear quickly

This example shows the importance of DNSrsquos multiple methodsof resilience (caching retries and at least some availability at oneauthoritative) It suggests that CDN operators may wish to considerlonger timeouts to allow caching to help and give DNS operatorsdeploy defenses Experiment H suggests 30 minutes Figure 8c

Configuring short TTLs serves a role in CDNs that use DNSto direct clients to different application-level servers Short TTLsallow for re-provisioning during DDoS attacks on web servers butthat leaves DNS servers vulnerable This tension suggests trafficscrubbing by routing changeswith longDNS TTLsmay be preferredto short DNS TTLs so that both layers can be robust Howeverthe complexity of interactions between DNS at multiple levels andCDNs suggests that more study is needed before recommendingspecific settings

Finally this evaluation helps complete our picture of DNS la-tency and reliability for DNS services that may consist of multipleauthoritatives some or all using IP anycast with multiple sites Tominimize latency prior work has shown a single authoritative usingIP anycast should maximize geographic dispersion of sites [46] Thelatency of an overall DNS service with multiple authoritatives canbe limited by the one with largest latency [27] Prior work aboutresilience to DDoS attack has shown that individual IP anycastsites will suffer under DDoS as a function of the attack traffic thatsite receives relative to its capacity [23] We show that the overallreliance of a DNS service composed of multiple authoritatives usingIP anycast tends to be as resilient as the strongest individual author-itative The reason for these opposite results is that in both casesrecursive resolvers will try all authoritatives of a given service Forlatency they will sometimes choose a distant authoritative butfor resilience they will continue until they find the most availableauthoritative

9 CONCLUSIONSThis paper represents the first study of how the DNS resolutionsystem behaves when authoritative servers are under DDoS attackCaching and retries at recursive resolvers are key factors in thisbehavior We show that together caching and retries by recursiveresolvers greatly improve the resilience of the DNS as a whole In

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 9: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

53 Discussion of Complete FailuresOverall we see that caching is partially successful in protecting dur-ing a DDoS With full valid caches half or more VPs get serviceHowever caches are filled at different times and expire so an op-erator cannot count on a full cache duration for any customerseven for popular (ldquoalways in the cacherdquo) domains The protectionprovided by caches depends on their state in the recursive resolversomething outside the operatorrsquos control In addition our evalua-tion of caching in sect3 showed that caches will end early for someVPs

Second we were surprised that a tiny fraction of VPs are suc-cessful after all caches should have timed out (after the 80 minutesperiod in Experiment A and between 90 and 110 minutes in Exper-iment C) These successes suggest an early deployment of ldquoservestalerdquo something currently under review in the IETF [19] is to servea previously known record beyond its TTL if authoritatives areunreachable with the goal of improving resilience under DDoS Weinvestigated the Experiment A where see that 1048 answers of the1140 successes in the second half of the outage These successesare from 471 VPs (and 215 recursives) most of them answered byOpenDNS and Google public DNS servers suggesting experimen-tation not yet widespread Out of these 1048 queries 1031 return aTTL value equals to 0 as specified in the IETF stale draft [19]

54 Client Reliability During PartialAuthoritative Failure

The previous section examined DDoS attacks that result in com-plete failure of all authoritatives but often DDoS attacks result inpartial failure with 50 or 90 packet loss at the authoritatives(For example consider the November 2015 DDoS attack on theDNS Root [23]) We next study experiments with partial failuresshowing that caching and retries together nearly fully protect 50DDoS events and protect half of VPs even during 90 events

We carry out several Experiments D to I in Table 4 We follow theprocedure outlined in sect51 looking at the DDoS-driven loss ratesof 50 75 and 90 with TTLs of 1800 s 300 s and 60 s Graphsomitted due to space can be found in Appendix D

Near-Full Protection from Caches During Moderate At-tacks We first consider Experiment E a ldquomildrdquo DDoS with 50loss with VP success over time in Figure 8a In spite of a loss ratethat would be crippling to TCP nearly all VPs are successful in DNSThis success is due to two factors first we know that many clientsare served from caches as was shown in Experiment A with fullloss (Figure 6a) Second most recursives retry queries so they re-cover from loss of a single packet and are able to provide an answerTogether these mean that failures during the first 30 minutes of theevent is 85 slightly higher than the 48 fraction of failures beforethe DDoS For this experiment the TTL is 1800 s (30minutes) sowe might expect failures to increase halfway through the DDoSWe do not see any increase in failures because caching and retriesare synergistic a successful retried query will place the answer in acache for a later query The importance of this result is that DNScan survive moderate-size attacks when caching is possible While apositive retries do increase latency something we study in sect55

Attack IntensityMattersWhile clients do quite well with 50loss at all authoritatives failures increase with the intensity of theattack

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

50 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment E (1800-50p-10min) 50 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

Ans

wer

sminutes after start

OK SERVFAIL No answer

(b) Experiment F (1800-75p-10min) 75 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(c) Experiment H (1800-90p-10min) 90 packet loss

0

5000

10000

15000

20000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(d) Experiment I (60-90p-10min) 90 packet loss

Figure 8 Answers received during DDoS attacks first andsecond vertical lines show start and end of DDoS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiments F and H shown in Figure 8b and Figure 8c increasethe loss rate to 75 and 90We see the number of failures increasesto about 190 with 75 loss and 403 with 90 loss It is importantto note that roughly 60 the clients are still served even with 90loss

We also see that this level of success is consistent over the entirehour-long DDoS event even though the cache duration is only30minutes This consistency confirms the importance of cachingand retries in combination

To verify the effects of this interaction Experiment I changesthe caching duration to 60 s less than one round or probing Com-paring Experiment I in Figure 8d to H in Figure 8c we see that thefailure rate increases from 30 to about 63 However even withno caching about 37 of queries still are answered due to resolversthat serve stale content and recursives retries We investigate retriesin sect6

55 Client Latency During Partial AuthoritativeFailure

We showed that client reliability is higher than expected duringfailures (sect54) due to a combination of caching and retries Wenext consider client latency Latency will increase during the DDoSbecause of retries and queueing delay but we will show that latencyincreases less than one might expect due to caching

To examine latency we return to Experiments D through I (Ta-ble 4) but look at latency (time to complete a query) rather thansuccess For these experiments clients timeout after 5 s

Figures 9a to 9d show latency during each emulated DDoS sce-nario (experiments with figures omitted here are in Appendix D La-tencies are not evenly distributed since some requests get throughimmediately while others must be retried one or more times so inaddition to mean we show 50 75 and 90 quantiles to characterizethe tail of the distribution

We emulate DDoS by dropping requests (sect51) and hence laten-cies reflect retries and loss but not queueing delay underrepresent-ing latency in real-world attacks However their shape (some lowlatency and a few long) is consistent with and helps explain whathas been seen in the past [23]

Beginning with Experiment E the moderate attack in Figure 9awe see no change to median latency This result is consistent withmany queries being handled by the cache and half of those nothandled by the cache getting through anyway We do see higherlatency in the 90ile tail reflecting successful retries This tail alsoincreases the mean some

This trend increases in Experiment F in Figure 9b where 75 ofqueries are lost Now we see the 75ile tail has increased as hasthe number of unanswered queries and the 90ile is twice as longas in Experiment E

We see the same latency in Experiment H with DDoS causing90 loss We set the timeouts to 5 s so the larger attack results inmore unsuccessful queries but latency for successful queries is notmuch worse than with 75 loss Median latency is still low due tocached replies

Finally Experiment I greatly reduces opportunities for cachingby reducing cache lifetime to one minute Figure 9d shows that lossof caching increases median RTT and significantly increases thetail latency Compared with Figure 9c (same packet loss ratio but

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment E 50 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160la

tency (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment F 75 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(c) Experiment H 90 packet loss(1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(d) Experiment I 90 packet loss (60 s TTL)

Figure 9 Latency results Shaded area indicates the intervalof an ongoing DDoS attack

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

1800 s TTL) we can clearly see the benefits of caching in terms oflatency (in addition to reliability) a half-hour TTL value reducedthe latency from 1300ms to 390ms Longer TTLs also help reducetail latency relative to shorter TTLs (compare for example the90ile RTT in Experiments I vs H in Figure 9)

Summary DDoS effects often increase client latency For mod-erate attacks increased latency is seen only by a few ldquounluckyrdquoclients whose do not see a full cache and whose queries are lostCaching has an important role in reducing latency during DDoSbut while it can often mitigate most reliability problems it cannotavoid latency penalties for all VPs Even when caching is not avail-able roughly 40 of clients get an answer either by serving staleor retries as we investigate next

6 THE AUTHORITATIVErsquoS PERSPECTIVEResults of partial DDoS events (sect54) show that DNS is surprisinglyreliablemdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches These results aredue to a combination of caching and retries We next examine thisfrom the perspective of the authoritative server

61 Recursive-Authoritative Traffic during aDDoS

We first ask are retries by recursive resolvers responsible for thesuccess rates observed in sect54 To investigate this question wereturn the partial DDoS experiments and look at how many queriesare sent to the authoritative servers We measure queries beforethey are dropped by our simulated DDoS Recursives must makemultiple queries to resolve a name We break out each type of queryfor the nameserver (NS) the nameserverrsquos IPv4 and v6 addresses(A-for-NS and AAAA-for-NS) and finally the desired query (AAAA-for-PID) Note that the authoritative is IPv4 only so AAAA-for-NSis non-existent and subject to negative caching while the otherrecords exist and use regular caching

We begin with the DDoS causing 75 loss in Figure 10a For thisexperiment we observe 18407 unique IP addresses of recursives(Rn ) querying for AAAA records directly to our authoritativesDuring the DDoS queries increase by about 35times We expect 4trials since the expected number of tries until success with lossrate p is (1 minus p)minus1 For this scenario results are cached for up to30 minutes so successful queries are reused in recursive cachesThis increase occurs both for the target AAAA record and alsofor the non-existent AAAA-for-NS records Negative caching forour zone is configured to 60 s making caching of NXDOMAINs forAAAA-for-NS less effective than positive caches

The offered load on the server increases further with more loss(90) as shown in Experiment H (Figure 10b) The higher loss rateresults in a much higher offered load on the server average 82timesnormal

Finally in Figure 10c we reduce the effects of caching at a 90DDoS and with a TTL of 60 s Here we see also about 81times morequeries at the server before the attack Comparing this case toExperiment H caching reduces the offered load on the server byabout 40

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(a) Experiment F 1800-75p-10min 75 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(b) Experiment H 1800-90p-10min 90 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(c) Experiment I 60-90p-10min 90 packet loss

Figure 10 Number of received queries by the authoritativeservers Shaded area indicates the interval of an ongoingDDoS attack

Implications The implication of this analysis is that legitimateclients ldquohammerrdquo with retries the already-stressed server during aDDoS For clients retries are important to get reliability and eachclient independently chooses to retry

The server is already under stress due to the DDoS so these re-tries add to that stress However the DDoS traffic is almost certainlymuch larger than the retried of legitimate traffic (A server experi-encing a volumetric attack causing 90 loss must be receiving 10timesits capacity Regular traffic is a small fraction of normal capacity soeven 4times regular is still much less than the attack traffic) The multi-plier for retried legitimate traffic depends on the implementationsstub and recursive resolver as well as application-level retries anddefection (users hitting reload in their browser and later giving up)Our experiment omits application-level retries and likely gives a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

lower bound We next examine specific recursive implementationsto see their behavior

62 Sources of Retries Software and Multi-levelRecursives

Experiments in the prior section showed that recursive resolversldquohammerrdquo authoritatives when queries are dropped We reexamineDNS software (since 2012 [56]) and additionally show deploymentsamplify retries

Recursive Software Prior work showed that recursive serversretry many times when an authoritative is unresponsive [56] withevaluation of BIND 97 and 98 DNSCache UnboundWindowsDNSand PowerDNS We studied retries in BIND 9103 and Unbound158 to quantify the number of retries Examining only requestsfor AAAA records we see that normal requests with a responsiveauthoritative ask for the AAAA records for all authoritatives andthe target name (3 total requests when there are 2 authoritatives)When all authoritatives are unavailable we see about 7times morerequests before the recursives time out (Exact numbers vary indifferent runs but typically each request is made 6 or 7 times)Such retries are appropriate provided they are paced (both useexponential backoff) they explain part of the increase in legitimatetraffic during DDoS events Full data is in Appendix E

Recursive DeploymentAnother source of extra retries is com-plex recursive deployments We showed that operators of largerecursives often use complex multi-level resolution infrastructure(sect35) This infrastructure can amplify the number of retries duringreachability problems at authoritatives

To quantify amplification we count both the number of Rn re-cursives and AAAA queries for each probe ID reaching our author-itatives Figure 11 show the results for Experiment I These valuesrepresent the amplification in two ways during stress more Rnrecursives will be used for each probe ID and these Rn will generatemore queries to the already stressed authoritatives As the figuresshow the median number of Rn recursives employed doubles (from1 to 2) during the DDoS event as does the 90ile (from 2 to 4) Themaximum rises to 39 The number of queries for each probe IDgrows more than 3times from 2 to 7 Worse the 90ile grows morethan 6times (3 queries to 18) The maximum grows 535times reachingup to 286 queries for one single probe ID This value however isa lower bound given there are a large number of A and AAAAqueries that ask for NS records and not the probe ID (AAAA andA-for NS in Figure 10)

We can also look at the aggregate effects of retries created by thecomplex recursive infrastructure Figure 12 shows the timeseries ofunique IP addresses of Rn observed at the authoritatives Before theDDoS period for Experiment I with TTL of 60 s we see a constantnumber of recursives reaching our authoritatives ie all queriesshould be answered by authoritatives (no caching at this TTL value)For experiments F and H both with TTL of 1800 s the number ofrecursives reaching our authoritative oscillates before the DDoSpeaks are observed when caches expire as expected

During the DDoS we observe a similar behavior for all threeexperiments in Figure 12 as packets are dropped at the authori-tative (at rates of 75 90 and 90 for F H and I respectively) wesee an increase on the number of Rn recursives querying our au-thoritatives for experiments F and H we see drops when caching

1

10

100

1000

0 20 40 60 80 100 120 140 160 180 1

10

100

1000

Rn-p

erP

ID

AA

AA

-for-

PID

minutes after start

Rn-per-PID-medianRn-per-PID-90-tileRn-per-PID-max

AAAA-for-PID-medianAAAA-for-PID-90-tileAAAA-for-PID-max

Figure 11 Rn recursives and AAAA queries used in Experi-ment I normalized by the number of probe IDs

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140 160 180

R

n r

eachin

g A

T

minutes after start

Experiment FExperiment HExperiment I

Figure 12 Unique Rn recursives addresses observed at au-thoritatives

is expected but not for experiment I The reason for this behavioris that the underlying layer of recursives starts forwarding queriesto other recursives which is amplified in the end (We show thisbehavior for an individual probe in Appendix F where we observethe growth in the number of queries received at the authoritativesand the number of recursives used)

Most complex resolution infrastructures are proprietary (as faras we know only one study has examined them [48]) so we cannotmake recommendations about how large recursive resolvers oughtto behave We suggest that the aggregate traffic of large recursiveresolvers should strive to be within a constant factor of singlerecursives perhaps a factor of 4 We also encourage additionalstudy of large recursive resolvers and their operators to shareinformation about their behavior

7 RELATEDWORKCaching by Recursives Several groups have shown that DNScaching can be imperfect Hao and Wang analyzed the impact ofnonce domains on DNS recursiversquos caches [13] Using two weeksof data from two universities they showed that filtering one-timedomains improves cache hit rates In two studies Pang et al [31 32]reported that web clients and local recursives do not always honorTTL values provided by authoritatives Almeida et al [2] analyzedDNS traces of a mobile operator and used a mobile applicationto see TTLS in practice They find that most domains have shortTTLs (less than 60 s) and report and evidence of TTL manipulationby recursives Schomp et al [48] demonstrate widespread use of

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

multi-level recursives by large operators as well as TTL manip-ulation Our work builds on this prior work examining cachingand TTL manipulation systematically and considering its effectson resilience

DNS client behavior Yu et al investigated how stubs and re-cursives select authoritative servers and were the first to demon-strate the large number of retries when all authoritatives are un-available [56] We also investigated how recursives select authori-tative servers in the wild and found that recursives tend to preferauthoritatives with shorter latency but query all authoritatives fordiversity [27] We confirm Yursquos work and focus on authoritativeselection during DDoS from several perspectives

Authoritatives during DDoS We investigated how the RootDNS service behaved during the Nov 2015 DDoS attacks [23] Thisreport focuses on the interactions of IP anycast and both latencyand reachability as seen from RIPE Atlas Rather than look at ag-gregate behavior and anycast our methodology here examines howclients interact with their recursive resolvers while this prior workfocused on authoritatives only bypassing recursives In additionhere we have full access to clients and authoritatives traffic dur-ing our experiments and we evaluate DDoS with controlled lossrates The prior study has incomplete data and focuses on specificresults of two events These differences stem from their study ofnatural experiments from real-world events and our controlledexperiments

8 IMPLICATIONSWe evaluated DNS resilience showing that caches and retries canmitigate much of the harm from a DDoS attack provided the cacheis full and some requests can get to authoritative servers The keyimplication of our study is to explain differences in the outcome ofrecent DDoS attacks

Recent DDoS attacks on DNS services have seen very differ-ent outcomes for users The Root Server System was a target inNov 2015 [41] and June 2016 [42] The DNS Root has 13 letterseach an authoritative ldquoserverrdquo implemented with some or many IPanycast instances Analysis of these DDoS events showed that theireffects were uneven across letters for some most or all anycastinstances showed high loss while other letters showed little or noloss [23] However the Root Operators state ldquoThere are no knownreports of end-user visible error conditions during and as a resultof this incident Because the DNS protocol is designed to cope withpartial reachability rdquo [41]

In Oct 2016 a much larger attack was directed at Dyn a providerof DNS service for many second-level domains [14] Although Dynhas a capable infrastructure and immediately took steps to addressservice problems there were reports of user-visible service disrup-tion in the technical and even popular press [34] Reports describeintermittent failure of prominent websites including ldquoTwitter Net-flix Spotify Airbnb Reddit Etsy SoundCloud and The New YorkTimesrdquo each a direct or indirect customer of Dyn at the time

Our work helps explain these very different outcomes The RootDNS saw few or no user-visible problems because data in the rootzone is cachable for a day or more and because multiple letters andmany anycast instances were continuously available (All measure-ments in this paragraph are as of 2018-05-22) Records in the rootzone have TTLs of 1 to 6 days and wwwroot-serversorg reports

922 anycast instances operating across the 13 authoritative serversDyn also operates a large infrastructure (httpsdyncomdnsnetwork-map reports 20 ldquofacilitiesrdquo) and faced a larger attack (reportsof 12 Tbs [47] compared to estimates of 35 Gbs for the Nov 2015root attack [23]) But a key difference is all of the Dynrsquos customerslisted above use DNS-based CDNs (for a description see [7]) withmultiple Dyn-hosted DNS components with TTLs that range from120 to 300 s

In addition to explaining the effects our experiments help get tothe root causes behind these outcomes Users of the Root benefitedfrom caching and saw performance like Experiment E (Figure 8a)because root contents (TLDs like com and country codes) are popu-lar and certainly cached in recursives and because some root letterswere always available to refresh caches (either through a successfulnormal query or a retry) By contrast users requiring domainswith very short TTLs (like the websites that had problems) receiveperformance more like Experiment I (Figure 8d) or Experiment C(Figure 6c) Even when some requests succeed an cache a popularname short TTLs cause caches to clear quickly

This example shows the importance of DNSrsquos multiple methodsof resilience (caching retries and at least some availability at oneauthoritative) It suggests that CDN operators may wish to considerlonger timeouts to allow caching to help and give DNS operatorsdeploy defenses Experiment H suggests 30 minutes Figure 8c

Configuring short TTLs serves a role in CDNs that use DNSto direct clients to different application-level servers Short TTLsallow for re-provisioning during DDoS attacks on web servers butthat leaves DNS servers vulnerable This tension suggests trafficscrubbing by routing changeswith longDNS TTLsmay be preferredto short DNS TTLs so that both layers can be robust Howeverthe complexity of interactions between DNS at multiple levels andCDNs suggests that more study is needed before recommendingspecific settings

Finally this evaluation helps complete our picture of DNS la-tency and reliability for DNS services that may consist of multipleauthoritatives some or all using IP anycast with multiple sites Tominimize latency prior work has shown a single authoritative usingIP anycast should maximize geographic dispersion of sites [46] Thelatency of an overall DNS service with multiple authoritatives canbe limited by the one with largest latency [27] Prior work aboutresilience to DDoS attack has shown that individual IP anycastsites will suffer under DDoS as a function of the attack traffic thatsite receives relative to its capacity [23] We show that the overallreliance of a DNS service composed of multiple authoritatives usingIP anycast tends to be as resilient as the strongest individual author-itative The reason for these opposite results is that in both casesrecursive resolvers will try all authoritatives of a given service Forlatency they will sometimes choose a distant authoritative butfor resilience they will continue until they find the most availableauthoritative

9 CONCLUSIONSThis paper represents the first study of how the DNS resolutionsystem behaves when authoritative servers are under DDoS attackCaching and retries at recursive resolvers are key factors in thisbehavior We show that together caching and retries by recursiveresolvers greatly improve the resilience of the DNS as a whole In

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 10: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

Experiments F and H shown in Figure 8b and Figure 8c increasethe loss rate to 75 and 90We see the number of failures increasesto about 190 with 75 loss and 403 with 90 loss It is importantto note that roughly 60 the clients are still served even with 90loss

We also see that this level of success is consistent over the entirehour-long DDoS event even though the cache duration is only30minutes This consistency confirms the importance of cachingand retries in combination

To verify the effects of this interaction Experiment I changesthe caching duration to 60 s less than one round or probing Com-paring Experiment I in Figure 8d to H in Figure 8c we see that thefailure rate increases from 30 to about 63 However even withno caching about 37 of queries still are answered due to resolversthat serve stale content and recursives retries We investigate retriesin sect6

55 Client Latency During Partial AuthoritativeFailure

We showed that client reliability is higher than expected duringfailures (sect54) due to a combination of caching and retries Wenext consider client latency Latency will increase during the DDoSbecause of retries and queueing delay but we will show that latencyincreases less than one might expect due to caching

To examine latency we return to Experiments D through I (Ta-ble 4) but look at latency (time to complete a query) rather thansuccess For these experiments clients timeout after 5 s

Figures 9a to 9d show latency during each emulated DDoS sce-nario (experiments with figures omitted here are in Appendix D La-tencies are not evenly distributed since some requests get throughimmediately while others must be retried one or more times so inaddition to mean we show 50 75 and 90 quantiles to characterizethe tail of the distribution

We emulate DDoS by dropping requests (sect51) and hence laten-cies reflect retries and loss but not queueing delay underrepresent-ing latency in real-world attacks However their shape (some lowlatency and a few long) is consistent with and helps explain whathas been seen in the past [23]

Beginning with Experiment E the moderate attack in Figure 9awe see no change to median latency This result is consistent withmany queries being handled by the cache and half of those nothandled by the cache getting through anyway We do see higherlatency in the 90ile tail reflecting successful retries This tail alsoincreases the mean some

This trend increases in Experiment F in Figure 9b where 75 ofqueries are lost Now we see the 75ile tail has increased as hasthe number of unanswered queries and the 90ile is twice as longas in Experiment E

We see the same latency in Experiment H with DDoS causing90 loss We set the timeouts to 5 s so the larger attack results inmore unsuccessful queries but latency for successful queries is notmuch worse than with 75 loss Median latency is still low due tocached replies

Finally Experiment I greatly reduces opportunities for cachingby reducing cache lifetime to one minute Figure 9d shows that lossof caching increases median RTT and significantly increases thetail latency Compared with Figure 9c (same packet loss ratio but

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment E 50 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160la

tency (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment F 75 packet loss (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(c) Experiment H 90 packet loss(1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy (

ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(d) Experiment I 90 packet loss (60 s TTL)

Figure 9 Latency results Shaded area indicates the intervalof an ongoing DDoS attack

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

1800 s TTL) we can clearly see the benefits of caching in terms oflatency (in addition to reliability) a half-hour TTL value reducedthe latency from 1300ms to 390ms Longer TTLs also help reducetail latency relative to shorter TTLs (compare for example the90ile RTT in Experiments I vs H in Figure 9)

Summary DDoS effects often increase client latency For mod-erate attacks increased latency is seen only by a few ldquounluckyrdquoclients whose do not see a full cache and whose queries are lostCaching has an important role in reducing latency during DDoSbut while it can often mitigate most reliability problems it cannotavoid latency penalties for all VPs Even when caching is not avail-able roughly 40 of clients get an answer either by serving staleor retries as we investigate next

6 THE AUTHORITATIVErsquoS PERSPECTIVEResults of partial DDoS events (sect54) show that DNS is surprisinglyreliablemdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches These results aredue to a combination of caching and retries We next examine thisfrom the perspective of the authoritative server

61 Recursive-Authoritative Traffic during aDDoS

We first ask are retries by recursive resolvers responsible for thesuccess rates observed in sect54 To investigate this question wereturn the partial DDoS experiments and look at how many queriesare sent to the authoritative servers We measure queries beforethey are dropped by our simulated DDoS Recursives must makemultiple queries to resolve a name We break out each type of queryfor the nameserver (NS) the nameserverrsquos IPv4 and v6 addresses(A-for-NS and AAAA-for-NS) and finally the desired query (AAAA-for-PID) Note that the authoritative is IPv4 only so AAAA-for-NSis non-existent and subject to negative caching while the otherrecords exist and use regular caching

We begin with the DDoS causing 75 loss in Figure 10a For thisexperiment we observe 18407 unique IP addresses of recursives(Rn ) querying for AAAA records directly to our authoritativesDuring the DDoS queries increase by about 35times We expect 4trials since the expected number of tries until success with lossrate p is (1 minus p)minus1 For this scenario results are cached for up to30 minutes so successful queries are reused in recursive cachesThis increase occurs both for the target AAAA record and alsofor the non-existent AAAA-for-NS records Negative caching forour zone is configured to 60 s making caching of NXDOMAINs forAAAA-for-NS less effective than positive caches

The offered load on the server increases further with more loss(90) as shown in Experiment H (Figure 10b) The higher loss rateresults in a much higher offered load on the server average 82timesnormal

Finally in Figure 10c we reduce the effects of caching at a 90DDoS and with a TTL of 60 s Here we see also about 81times morequeries at the server before the attack Comparing this case toExperiment H caching reduces the offered load on the server byabout 40

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(a) Experiment F 1800-75p-10min 75 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(b) Experiment H 1800-90p-10min 90 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(c) Experiment I 60-90p-10min 90 packet loss

Figure 10 Number of received queries by the authoritativeservers Shaded area indicates the interval of an ongoingDDoS attack

Implications The implication of this analysis is that legitimateclients ldquohammerrdquo with retries the already-stressed server during aDDoS For clients retries are important to get reliability and eachclient independently chooses to retry

The server is already under stress due to the DDoS so these re-tries add to that stress However the DDoS traffic is almost certainlymuch larger than the retried of legitimate traffic (A server experi-encing a volumetric attack causing 90 loss must be receiving 10timesits capacity Regular traffic is a small fraction of normal capacity soeven 4times regular is still much less than the attack traffic) The multi-plier for retried legitimate traffic depends on the implementationsstub and recursive resolver as well as application-level retries anddefection (users hitting reload in their browser and later giving up)Our experiment omits application-level retries and likely gives a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

lower bound We next examine specific recursive implementationsto see their behavior

62 Sources of Retries Software and Multi-levelRecursives

Experiments in the prior section showed that recursive resolversldquohammerrdquo authoritatives when queries are dropped We reexamineDNS software (since 2012 [56]) and additionally show deploymentsamplify retries

Recursive Software Prior work showed that recursive serversretry many times when an authoritative is unresponsive [56] withevaluation of BIND 97 and 98 DNSCache UnboundWindowsDNSand PowerDNS We studied retries in BIND 9103 and Unbound158 to quantify the number of retries Examining only requestsfor AAAA records we see that normal requests with a responsiveauthoritative ask for the AAAA records for all authoritatives andthe target name (3 total requests when there are 2 authoritatives)When all authoritatives are unavailable we see about 7times morerequests before the recursives time out (Exact numbers vary indifferent runs but typically each request is made 6 or 7 times)Such retries are appropriate provided they are paced (both useexponential backoff) they explain part of the increase in legitimatetraffic during DDoS events Full data is in Appendix E

Recursive DeploymentAnother source of extra retries is com-plex recursive deployments We showed that operators of largerecursives often use complex multi-level resolution infrastructure(sect35) This infrastructure can amplify the number of retries duringreachability problems at authoritatives

To quantify amplification we count both the number of Rn re-cursives and AAAA queries for each probe ID reaching our author-itatives Figure 11 show the results for Experiment I These valuesrepresent the amplification in two ways during stress more Rnrecursives will be used for each probe ID and these Rn will generatemore queries to the already stressed authoritatives As the figuresshow the median number of Rn recursives employed doubles (from1 to 2) during the DDoS event as does the 90ile (from 2 to 4) Themaximum rises to 39 The number of queries for each probe IDgrows more than 3times from 2 to 7 Worse the 90ile grows morethan 6times (3 queries to 18) The maximum grows 535times reachingup to 286 queries for one single probe ID This value however isa lower bound given there are a large number of A and AAAAqueries that ask for NS records and not the probe ID (AAAA andA-for NS in Figure 10)

We can also look at the aggregate effects of retries created by thecomplex recursive infrastructure Figure 12 shows the timeseries ofunique IP addresses of Rn observed at the authoritatives Before theDDoS period for Experiment I with TTL of 60 s we see a constantnumber of recursives reaching our authoritatives ie all queriesshould be answered by authoritatives (no caching at this TTL value)For experiments F and H both with TTL of 1800 s the number ofrecursives reaching our authoritative oscillates before the DDoSpeaks are observed when caches expire as expected

During the DDoS we observe a similar behavior for all threeexperiments in Figure 12 as packets are dropped at the authori-tative (at rates of 75 90 and 90 for F H and I respectively) wesee an increase on the number of Rn recursives querying our au-thoritatives for experiments F and H we see drops when caching

1

10

100

1000

0 20 40 60 80 100 120 140 160 180 1

10

100

1000

Rn-p

erP

ID

AA

AA

-for-

PID

minutes after start

Rn-per-PID-medianRn-per-PID-90-tileRn-per-PID-max

AAAA-for-PID-medianAAAA-for-PID-90-tileAAAA-for-PID-max

Figure 11 Rn recursives and AAAA queries used in Experi-ment I normalized by the number of probe IDs

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140 160 180

R

n r

eachin

g A

T

minutes after start

Experiment FExperiment HExperiment I

Figure 12 Unique Rn recursives addresses observed at au-thoritatives

is expected but not for experiment I The reason for this behavioris that the underlying layer of recursives starts forwarding queriesto other recursives which is amplified in the end (We show thisbehavior for an individual probe in Appendix F where we observethe growth in the number of queries received at the authoritativesand the number of recursives used)

Most complex resolution infrastructures are proprietary (as faras we know only one study has examined them [48]) so we cannotmake recommendations about how large recursive resolvers oughtto behave We suggest that the aggregate traffic of large recursiveresolvers should strive to be within a constant factor of singlerecursives perhaps a factor of 4 We also encourage additionalstudy of large recursive resolvers and their operators to shareinformation about their behavior

7 RELATEDWORKCaching by Recursives Several groups have shown that DNScaching can be imperfect Hao and Wang analyzed the impact ofnonce domains on DNS recursiversquos caches [13] Using two weeksof data from two universities they showed that filtering one-timedomains improves cache hit rates In two studies Pang et al [31 32]reported that web clients and local recursives do not always honorTTL values provided by authoritatives Almeida et al [2] analyzedDNS traces of a mobile operator and used a mobile applicationto see TTLS in practice They find that most domains have shortTTLs (less than 60 s) and report and evidence of TTL manipulationby recursives Schomp et al [48] demonstrate widespread use of

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

multi-level recursives by large operators as well as TTL manip-ulation Our work builds on this prior work examining cachingand TTL manipulation systematically and considering its effectson resilience

DNS client behavior Yu et al investigated how stubs and re-cursives select authoritative servers and were the first to demon-strate the large number of retries when all authoritatives are un-available [56] We also investigated how recursives select authori-tative servers in the wild and found that recursives tend to preferauthoritatives with shorter latency but query all authoritatives fordiversity [27] We confirm Yursquos work and focus on authoritativeselection during DDoS from several perspectives

Authoritatives during DDoS We investigated how the RootDNS service behaved during the Nov 2015 DDoS attacks [23] Thisreport focuses on the interactions of IP anycast and both latencyand reachability as seen from RIPE Atlas Rather than look at ag-gregate behavior and anycast our methodology here examines howclients interact with their recursive resolvers while this prior workfocused on authoritatives only bypassing recursives In additionhere we have full access to clients and authoritatives traffic dur-ing our experiments and we evaluate DDoS with controlled lossrates The prior study has incomplete data and focuses on specificresults of two events These differences stem from their study ofnatural experiments from real-world events and our controlledexperiments

8 IMPLICATIONSWe evaluated DNS resilience showing that caches and retries canmitigate much of the harm from a DDoS attack provided the cacheis full and some requests can get to authoritative servers The keyimplication of our study is to explain differences in the outcome ofrecent DDoS attacks

Recent DDoS attacks on DNS services have seen very differ-ent outcomes for users The Root Server System was a target inNov 2015 [41] and June 2016 [42] The DNS Root has 13 letterseach an authoritative ldquoserverrdquo implemented with some or many IPanycast instances Analysis of these DDoS events showed that theireffects were uneven across letters for some most or all anycastinstances showed high loss while other letters showed little or noloss [23] However the Root Operators state ldquoThere are no knownreports of end-user visible error conditions during and as a resultof this incident Because the DNS protocol is designed to cope withpartial reachability rdquo [41]

In Oct 2016 a much larger attack was directed at Dyn a providerof DNS service for many second-level domains [14] Although Dynhas a capable infrastructure and immediately took steps to addressservice problems there were reports of user-visible service disrup-tion in the technical and even popular press [34] Reports describeintermittent failure of prominent websites including ldquoTwitter Net-flix Spotify Airbnb Reddit Etsy SoundCloud and The New YorkTimesrdquo each a direct or indirect customer of Dyn at the time

Our work helps explain these very different outcomes The RootDNS saw few or no user-visible problems because data in the rootzone is cachable for a day or more and because multiple letters andmany anycast instances were continuously available (All measure-ments in this paragraph are as of 2018-05-22) Records in the rootzone have TTLs of 1 to 6 days and wwwroot-serversorg reports

922 anycast instances operating across the 13 authoritative serversDyn also operates a large infrastructure (httpsdyncomdnsnetwork-map reports 20 ldquofacilitiesrdquo) and faced a larger attack (reportsof 12 Tbs [47] compared to estimates of 35 Gbs for the Nov 2015root attack [23]) But a key difference is all of the Dynrsquos customerslisted above use DNS-based CDNs (for a description see [7]) withmultiple Dyn-hosted DNS components with TTLs that range from120 to 300 s

In addition to explaining the effects our experiments help get tothe root causes behind these outcomes Users of the Root benefitedfrom caching and saw performance like Experiment E (Figure 8a)because root contents (TLDs like com and country codes) are popu-lar and certainly cached in recursives and because some root letterswere always available to refresh caches (either through a successfulnormal query or a retry) By contrast users requiring domainswith very short TTLs (like the websites that had problems) receiveperformance more like Experiment I (Figure 8d) or Experiment C(Figure 6c) Even when some requests succeed an cache a popularname short TTLs cause caches to clear quickly

This example shows the importance of DNSrsquos multiple methodsof resilience (caching retries and at least some availability at oneauthoritative) It suggests that CDN operators may wish to considerlonger timeouts to allow caching to help and give DNS operatorsdeploy defenses Experiment H suggests 30 minutes Figure 8c

Configuring short TTLs serves a role in CDNs that use DNSto direct clients to different application-level servers Short TTLsallow for re-provisioning during DDoS attacks on web servers butthat leaves DNS servers vulnerable This tension suggests trafficscrubbing by routing changeswith longDNS TTLsmay be preferredto short DNS TTLs so that both layers can be robust Howeverthe complexity of interactions between DNS at multiple levels andCDNs suggests that more study is needed before recommendingspecific settings

Finally this evaluation helps complete our picture of DNS la-tency and reliability for DNS services that may consist of multipleauthoritatives some or all using IP anycast with multiple sites Tominimize latency prior work has shown a single authoritative usingIP anycast should maximize geographic dispersion of sites [46] Thelatency of an overall DNS service with multiple authoritatives canbe limited by the one with largest latency [27] Prior work aboutresilience to DDoS attack has shown that individual IP anycastsites will suffer under DDoS as a function of the attack traffic thatsite receives relative to its capacity [23] We show that the overallreliance of a DNS service composed of multiple authoritatives usingIP anycast tends to be as resilient as the strongest individual author-itative The reason for these opposite results is that in both casesrecursive resolvers will try all authoritatives of a given service Forlatency they will sometimes choose a distant authoritative butfor resilience they will continue until they find the most availableauthoritative

9 CONCLUSIONSThis paper represents the first study of how the DNS resolutionsystem behaves when authoritative servers are under DDoS attackCaching and retries at recursive resolvers are key factors in thisbehavior We show that together caching and retries by recursiveresolvers greatly improve the resilience of the DNS as a whole In

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 11: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

1800 s TTL) we can clearly see the benefits of caching in terms oflatency (in addition to reliability) a half-hour TTL value reducedthe latency from 1300ms to 390ms Longer TTLs also help reducetail latency relative to shorter TTLs (compare for example the90ile RTT in Experiments I vs H in Figure 9)

Summary DDoS effects often increase client latency For mod-erate attacks increased latency is seen only by a few ldquounluckyrdquoclients whose do not see a full cache and whose queries are lostCaching has an important role in reducing latency during DDoSbut while it can often mitigate most reliability problems it cannotavoid latency penalties for all VPs Even when caching is not avail-able roughly 40 of clients get an answer either by serving staleor retries as we investigate next

6 THE AUTHORITATIVErsquoS PERSPECTIVEResults of partial DDoS events (sect54) show that DNS is surprisinglyreliablemdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches These results aredue to a combination of caching and retries We next examine thisfrom the perspective of the authoritative server

61 Recursive-Authoritative Traffic during aDDoS

We first ask are retries by recursive resolvers responsible for thesuccess rates observed in sect54 To investigate this question wereturn the partial DDoS experiments and look at how many queriesare sent to the authoritative servers We measure queries beforethey are dropped by our simulated DDoS Recursives must makemultiple queries to resolve a name We break out each type of queryfor the nameserver (NS) the nameserverrsquos IPv4 and v6 addresses(A-for-NS and AAAA-for-NS) and finally the desired query (AAAA-for-PID) Note that the authoritative is IPv4 only so AAAA-for-NSis non-existent and subject to negative caching while the otherrecords exist and use regular caching

We begin with the DDoS causing 75 loss in Figure 10a For thisexperiment we observe 18407 unique IP addresses of recursives(Rn ) querying for AAAA records directly to our authoritativesDuring the DDoS queries increase by about 35times We expect 4trials since the expected number of tries until success with lossrate p is (1 minus p)minus1 For this scenario results are cached for up to30 minutes so successful queries are reused in recursive cachesThis increase occurs both for the target AAAA record and alsofor the non-existent AAAA-for-NS records Negative caching forour zone is configured to 60 s making caching of NXDOMAINs forAAAA-for-NS less effective than positive caches

The offered load on the server increases further with more loss(90) as shown in Experiment H (Figure 10b) The higher loss rateresults in a much higher offered load on the server average 82timesnormal

Finally in Figure 10c we reduce the effects of caching at a 90DDoS and with a TTL of 60 s Here we see also about 81times morequeries at the server before the attack Comparing this case toExperiment H caching reduces the offered load on the server byabout 40

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

75 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(a) Experiment F 1800-75p-10min 75 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(b) Experiment H 1800-90p-10min 90 packet loss

0

50000

100000

150000

200000

0 10 20 30 40 50 60 70 80 90 100110120130140150160170

90 packet loss(both NSes)

normal normal

queries

minutes after start

NSA-for-NS

AAAA-for-NSAAAA-for-PID

(c) Experiment I 60-90p-10min 90 packet loss

Figure 10 Number of received queries by the authoritativeservers Shaded area indicates the interval of an ongoingDDoS attack

Implications The implication of this analysis is that legitimateclients ldquohammerrdquo with retries the already-stressed server during aDDoS For clients retries are important to get reliability and eachclient independently chooses to retry

The server is already under stress due to the DDoS so these re-tries add to that stress However the DDoS traffic is almost certainlymuch larger than the retried of legitimate traffic (A server experi-encing a volumetric attack causing 90 loss must be receiving 10timesits capacity Regular traffic is a small fraction of normal capacity soeven 4times regular is still much less than the attack traffic) The multi-plier for retried legitimate traffic depends on the implementationsstub and recursive resolver as well as application-level retries anddefection (users hitting reload in their browser and later giving up)Our experiment omits application-level retries and likely gives a

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

lower bound We next examine specific recursive implementationsto see their behavior

62 Sources of Retries Software and Multi-levelRecursives

Experiments in the prior section showed that recursive resolversldquohammerrdquo authoritatives when queries are dropped We reexamineDNS software (since 2012 [56]) and additionally show deploymentsamplify retries

Recursive Software Prior work showed that recursive serversretry many times when an authoritative is unresponsive [56] withevaluation of BIND 97 and 98 DNSCache UnboundWindowsDNSand PowerDNS We studied retries in BIND 9103 and Unbound158 to quantify the number of retries Examining only requestsfor AAAA records we see that normal requests with a responsiveauthoritative ask for the AAAA records for all authoritatives andthe target name (3 total requests when there are 2 authoritatives)When all authoritatives are unavailable we see about 7times morerequests before the recursives time out (Exact numbers vary indifferent runs but typically each request is made 6 or 7 times)Such retries are appropriate provided they are paced (both useexponential backoff) they explain part of the increase in legitimatetraffic during DDoS events Full data is in Appendix E

Recursive DeploymentAnother source of extra retries is com-plex recursive deployments We showed that operators of largerecursives often use complex multi-level resolution infrastructure(sect35) This infrastructure can amplify the number of retries duringreachability problems at authoritatives

To quantify amplification we count both the number of Rn re-cursives and AAAA queries for each probe ID reaching our author-itatives Figure 11 show the results for Experiment I These valuesrepresent the amplification in two ways during stress more Rnrecursives will be used for each probe ID and these Rn will generatemore queries to the already stressed authoritatives As the figuresshow the median number of Rn recursives employed doubles (from1 to 2) during the DDoS event as does the 90ile (from 2 to 4) Themaximum rises to 39 The number of queries for each probe IDgrows more than 3times from 2 to 7 Worse the 90ile grows morethan 6times (3 queries to 18) The maximum grows 535times reachingup to 286 queries for one single probe ID This value however isa lower bound given there are a large number of A and AAAAqueries that ask for NS records and not the probe ID (AAAA andA-for NS in Figure 10)

We can also look at the aggregate effects of retries created by thecomplex recursive infrastructure Figure 12 shows the timeseries ofunique IP addresses of Rn observed at the authoritatives Before theDDoS period for Experiment I with TTL of 60 s we see a constantnumber of recursives reaching our authoritatives ie all queriesshould be answered by authoritatives (no caching at this TTL value)For experiments F and H both with TTL of 1800 s the number ofrecursives reaching our authoritative oscillates before the DDoSpeaks are observed when caches expire as expected

During the DDoS we observe a similar behavior for all threeexperiments in Figure 12 as packets are dropped at the authori-tative (at rates of 75 90 and 90 for F H and I respectively) wesee an increase on the number of Rn recursives querying our au-thoritatives for experiments F and H we see drops when caching

1

10

100

1000

0 20 40 60 80 100 120 140 160 180 1

10

100

1000

Rn-p

erP

ID

AA

AA

-for-

PID

minutes after start

Rn-per-PID-medianRn-per-PID-90-tileRn-per-PID-max

AAAA-for-PID-medianAAAA-for-PID-90-tileAAAA-for-PID-max

Figure 11 Rn recursives and AAAA queries used in Experi-ment I normalized by the number of probe IDs

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140 160 180

R

n r

eachin

g A

T

minutes after start

Experiment FExperiment HExperiment I

Figure 12 Unique Rn recursives addresses observed at au-thoritatives

is expected but not for experiment I The reason for this behavioris that the underlying layer of recursives starts forwarding queriesto other recursives which is amplified in the end (We show thisbehavior for an individual probe in Appendix F where we observethe growth in the number of queries received at the authoritativesand the number of recursives used)

Most complex resolution infrastructures are proprietary (as faras we know only one study has examined them [48]) so we cannotmake recommendations about how large recursive resolvers oughtto behave We suggest that the aggregate traffic of large recursiveresolvers should strive to be within a constant factor of singlerecursives perhaps a factor of 4 We also encourage additionalstudy of large recursive resolvers and their operators to shareinformation about their behavior

7 RELATEDWORKCaching by Recursives Several groups have shown that DNScaching can be imperfect Hao and Wang analyzed the impact ofnonce domains on DNS recursiversquos caches [13] Using two weeksof data from two universities they showed that filtering one-timedomains improves cache hit rates In two studies Pang et al [31 32]reported that web clients and local recursives do not always honorTTL values provided by authoritatives Almeida et al [2] analyzedDNS traces of a mobile operator and used a mobile applicationto see TTLS in practice They find that most domains have shortTTLs (less than 60 s) and report and evidence of TTL manipulationby recursives Schomp et al [48] demonstrate widespread use of

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

multi-level recursives by large operators as well as TTL manip-ulation Our work builds on this prior work examining cachingand TTL manipulation systematically and considering its effectson resilience

DNS client behavior Yu et al investigated how stubs and re-cursives select authoritative servers and were the first to demon-strate the large number of retries when all authoritatives are un-available [56] We also investigated how recursives select authori-tative servers in the wild and found that recursives tend to preferauthoritatives with shorter latency but query all authoritatives fordiversity [27] We confirm Yursquos work and focus on authoritativeselection during DDoS from several perspectives

Authoritatives during DDoS We investigated how the RootDNS service behaved during the Nov 2015 DDoS attacks [23] Thisreport focuses on the interactions of IP anycast and both latencyand reachability as seen from RIPE Atlas Rather than look at ag-gregate behavior and anycast our methodology here examines howclients interact with their recursive resolvers while this prior workfocused on authoritatives only bypassing recursives In additionhere we have full access to clients and authoritatives traffic dur-ing our experiments and we evaluate DDoS with controlled lossrates The prior study has incomplete data and focuses on specificresults of two events These differences stem from their study ofnatural experiments from real-world events and our controlledexperiments

8 IMPLICATIONSWe evaluated DNS resilience showing that caches and retries canmitigate much of the harm from a DDoS attack provided the cacheis full and some requests can get to authoritative servers The keyimplication of our study is to explain differences in the outcome ofrecent DDoS attacks

Recent DDoS attacks on DNS services have seen very differ-ent outcomes for users The Root Server System was a target inNov 2015 [41] and June 2016 [42] The DNS Root has 13 letterseach an authoritative ldquoserverrdquo implemented with some or many IPanycast instances Analysis of these DDoS events showed that theireffects were uneven across letters for some most or all anycastinstances showed high loss while other letters showed little or noloss [23] However the Root Operators state ldquoThere are no knownreports of end-user visible error conditions during and as a resultof this incident Because the DNS protocol is designed to cope withpartial reachability rdquo [41]

In Oct 2016 a much larger attack was directed at Dyn a providerof DNS service for many second-level domains [14] Although Dynhas a capable infrastructure and immediately took steps to addressservice problems there were reports of user-visible service disrup-tion in the technical and even popular press [34] Reports describeintermittent failure of prominent websites including ldquoTwitter Net-flix Spotify Airbnb Reddit Etsy SoundCloud and The New YorkTimesrdquo each a direct or indirect customer of Dyn at the time

Our work helps explain these very different outcomes The RootDNS saw few or no user-visible problems because data in the rootzone is cachable for a day or more and because multiple letters andmany anycast instances were continuously available (All measure-ments in this paragraph are as of 2018-05-22) Records in the rootzone have TTLs of 1 to 6 days and wwwroot-serversorg reports

922 anycast instances operating across the 13 authoritative serversDyn also operates a large infrastructure (httpsdyncomdnsnetwork-map reports 20 ldquofacilitiesrdquo) and faced a larger attack (reportsof 12 Tbs [47] compared to estimates of 35 Gbs for the Nov 2015root attack [23]) But a key difference is all of the Dynrsquos customerslisted above use DNS-based CDNs (for a description see [7]) withmultiple Dyn-hosted DNS components with TTLs that range from120 to 300 s

In addition to explaining the effects our experiments help get tothe root causes behind these outcomes Users of the Root benefitedfrom caching and saw performance like Experiment E (Figure 8a)because root contents (TLDs like com and country codes) are popu-lar and certainly cached in recursives and because some root letterswere always available to refresh caches (either through a successfulnormal query or a retry) By contrast users requiring domainswith very short TTLs (like the websites that had problems) receiveperformance more like Experiment I (Figure 8d) or Experiment C(Figure 6c) Even when some requests succeed an cache a popularname short TTLs cause caches to clear quickly

This example shows the importance of DNSrsquos multiple methodsof resilience (caching retries and at least some availability at oneauthoritative) It suggests that CDN operators may wish to considerlonger timeouts to allow caching to help and give DNS operatorsdeploy defenses Experiment H suggests 30 minutes Figure 8c

Configuring short TTLs serves a role in CDNs that use DNSto direct clients to different application-level servers Short TTLsallow for re-provisioning during DDoS attacks on web servers butthat leaves DNS servers vulnerable This tension suggests trafficscrubbing by routing changeswith longDNS TTLsmay be preferredto short DNS TTLs so that both layers can be robust Howeverthe complexity of interactions between DNS at multiple levels andCDNs suggests that more study is needed before recommendingspecific settings

Finally this evaluation helps complete our picture of DNS la-tency and reliability for DNS services that may consist of multipleauthoritatives some or all using IP anycast with multiple sites Tominimize latency prior work has shown a single authoritative usingIP anycast should maximize geographic dispersion of sites [46] Thelatency of an overall DNS service with multiple authoritatives canbe limited by the one with largest latency [27] Prior work aboutresilience to DDoS attack has shown that individual IP anycastsites will suffer under DDoS as a function of the attack traffic thatsite receives relative to its capacity [23] We show that the overallreliance of a DNS service composed of multiple authoritatives usingIP anycast tends to be as resilient as the strongest individual author-itative The reason for these opposite results is that in both casesrecursive resolvers will try all authoritatives of a given service Forlatency they will sometimes choose a distant authoritative butfor resilience they will continue until they find the most availableauthoritative

9 CONCLUSIONSThis paper represents the first study of how the DNS resolutionsystem behaves when authoritative servers are under DDoS attackCaching and retries at recursive resolvers are key factors in thisbehavior We show that together caching and retries by recursiveresolvers greatly improve the resilience of the DNS as a whole In

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 12: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

lower bound We next examine specific recursive implementationsto see their behavior

62 Sources of Retries Software and Multi-levelRecursives

Experiments in the prior section showed that recursive resolversldquohammerrdquo authoritatives when queries are dropped We reexamineDNS software (since 2012 [56]) and additionally show deploymentsamplify retries

Recursive Software Prior work showed that recursive serversretry many times when an authoritative is unresponsive [56] withevaluation of BIND 97 and 98 DNSCache UnboundWindowsDNSand PowerDNS We studied retries in BIND 9103 and Unbound158 to quantify the number of retries Examining only requestsfor AAAA records we see that normal requests with a responsiveauthoritative ask for the AAAA records for all authoritatives andthe target name (3 total requests when there are 2 authoritatives)When all authoritatives are unavailable we see about 7times morerequests before the recursives time out (Exact numbers vary indifferent runs but typically each request is made 6 or 7 times)Such retries are appropriate provided they are paced (both useexponential backoff) they explain part of the increase in legitimatetraffic during DDoS events Full data is in Appendix E

Recursive DeploymentAnother source of extra retries is com-plex recursive deployments We showed that operators of largerecursives often use complex multi-level resolution infrastructure(sect35) This infrastructure can amplify the number of retries duringreachability problems at authoritatives

To quantify amplification we count both the number of Rn re-cursives and AAAA queries for each probe ID reaching our author-itatives Figure 11 show the results for Experiment I These valuesrepresent the amplification in two ways during stress more Rnrecursives will be used for each probe ID and these Rn will generatemore queries to the already stressed authoritatives As the figuresshow the median number of Rn recursives employed doubles (from1 to 2) during the DDoS event as does the 90ile (from 2 to 4) Themaximum rises to 39 The number of queries for each probe IDgrows more than 3times from 2 to 7 Worse the 90ile grows morethan 6times (3 queries to 18) The maximum grows 535times reachingup to 286 queries for one single probe ID This value however isa lower bound given there are a large number of A and AAAAqueries that ask for NS records and not the probe ID (AAAA andA-for NS in Figure 10)

We can also look at the aggregate effects of retries created by thecomplex recursive infrastructure Figure 12 shows the timeseries ofunique IP addresses of Rn observed at the authoritatives Before theDDoS period for Experiment I with TTL of 60 s we see a constantnumber of recursives reaching our authoritatives ie all queriesshould be answered by authoritatives (no caching at this TTL value)For experiments F and H both with TTL of 1800 s the number ofrecursives reaching our authoritative oscillates before the DDoSpeaks are observed when caches expire as expected

During the DDoS we observe a similar behavior for all threeexperiments in Figure 12 as packets are dropped at the authori-tative (at rates of 75 90 and 90 for F H and I respectively) wesee an increase on the number of Rn recursives querying our au-thoritatives for experiments F and H we see drops when caching

1

10

100

1000

0 20 40 60 80 100 120 140 160 180 1

10

100

1000

Rn-p

erP

ID

AA

AA

-for-

PID

minutes after start

Rn-per-PID-medianRn-per-PID-90-tileRn-per-PID-max

AAAA-for-PID-medianAAAA-for-PID-90-tileAAAA-for-PID-max

Figure 11 Rn recursives and AAAA queries used in Experi-ment I normalized by the number of probe IDs

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140 160 180

R

n r

eachin

g A

T

minutes after start

Experiment FExperiment HExperiment I

Figure 12 Unique Rn recursives addresses observed at au-thoritatives

is expected but not for experiment I The reason for this behavioris that the underlying layer of recursives starts forwarding queriesto other recursives which is amplified in the end (We show thisbehavior for an individual probe in Appendix F where we observethe growth in the number of queries received at the authoritativesand the number of recursives used)

Most complex resolution infrastructures are proprietary (as faras we know only one study has examined them [48]) so we cannotmake recommendations about how large recursive resolvers oughtto behave We suggest that the aggregate traffic of large recursiveresolvers should strive to be within a constant factor of singlerecursives perhaps a factor of 4 We also encourage additionalstudy of large recursive resolvers and their operators to shareinformation about their behavior

7 RELATEDWORKCaching by Recursives Several groups have shown that DNScaching can be imperfect Hao and Wang analyzed the impact ofnonce domains on DNS recursiversquos caches [13] Using two weeksof data from two universities they showed that filtering one-timedomains improves cache hit rates In two studies Pang et al [31 32]reported that web clients and local recursives do not always honorTTL values provided by authoritatives Almeida et al [2] analyzedDNS traces of a mobile operator and used a mobile applicationto see TTLS in practice They find that most domains have shortTTLs (less than 60 s) and report and evidence of TTL manipulationby recursives Schomp et al [48] demonstrate widespread use of

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

multi-level recursives by large operators as well as TTL manip-ulation Our work builds on this prior work examining cachingand TTL manipulation systematically and considering its effectson resilience

DNS client behavior Yu et al investigated how stubs and re-cursives select authoritative servers and were the first to demon-strate the large number of retries when all authoritatives are un-available [56] We also investigated how recursives select authori-tative servers in the wild and found that recursives tend to preferauthoritatives with shorter latency but query all authoritatives fordiversity [27] We confirm Yursquos work and focus on authoritativeselection during DDoS from several perspectives

Authoritatives during DDoS We investigated how the RootDNS service behaved during the Nov 2015 DDoS attacks [23] Thisreport focuses on the interactions of IP anycast and both latencyand reachability as seen from RIPE Atlas Rather than look at ag-gregate behavior and anycast our methodology here examines howclients interact with their recursive resolvers while this prior workfocused on authoritatives only bypassing recursives In additionhere we have full access to clients and authoritatives traffic dur-ing our experiments and we evaluate DDoS with controlled lossrates The prior study has incomplete data and focuses on specificresults of two events These differences stem from their study ofnatural experiments from real-world events and our controlledexperiments

8 IMPLICATIONSWe evaluated DNS resilience showing that caches and retries canmitigate much of the harm from a DDoS attack provided the cacheis full and some requests can get to authoritative servers The keyimplication of our study is to explain differences in the outcome ofrecent DDoS attacks

Recent DDoS attacks on DNS services have seen very differ-ent outcomes for users The Root Server System was a target inNov 2015 [41] and June 2016 [42] The DNS Root has 13 letterseach an authoritative ldquoserverrdquo implemented with some or many IPanycast instances Analysis of these DDoS events showed that theireffects were uneven across letters for some most or all anycastinstances showed high loss while other letters showed little or noloss [23] However the Root Operators state ldquoThere are no knownreports of end-user visible error conditions during and as a resultof this incident Because the DNS protocol is designed to cope withpartial reachability rdquo [41]

In Oct 2016 a much larger attack was directed at Dyn a providerof DNS service for many second-level domains [14] Although Dynhas a capable infrastructure and immediately took steps to addressservice problems there were reports of user-visible service disrup-tion in the technical and even popular press [34] Reports describeintermittent failure of prominent websites including ldquoTwitter Net-flix Spotify Airbnb Reddit Etsy SoundCloud and The New YorkTimesrdquo each a direct or indirect customer of Dyn at the time

Our work helps explain these very different outcomes The RootDNS saw few or no user-visible problems because data in the rootzone is cachable for a day or more and because multiple letters andmany anycast instances were continuously available (All measure-ments in this paragraph are as of 2018-05-22) Records in the rootzone have TTLs of 1 to 6 days and wwwroot-serversorg reports

922 anycast instances operating across the 13 authoritative serversDyn also operates a large infrastructure (httpsdyncomdnsnetwork-map reports 20 ldquofacilitiesrdquo) and faced a larger attack (reportsof 12 Tbs [47] compared to estimates of 35 Gbs for the Nov 2015root attack [23]) But a key difference is all of the Dynrsquos customerslisted above use DNS-based CDNs (for a description see [7]) withmultiple Dyn-hosted DNS components with TTLs that range from120 to 300 s

In addition to explaining the effects our experiments help get tothe root causes behind these outcomes Users of the Root benefitedfrom caching and saw performance like Experiment E (Figure 8a)because root contents (TLDs like com and country codes) are popu-lar and certainly cached in recursives and because some root letterswere always available to refresh caches (either through a successfulnormal query or a retry) By contrast users requiring domainswith very short TTLs (like the websites that had problems) receiveperformance more like Experiment I (Figure 8d) or Experiment C(Figure 6c) Even when some requests succeed an cache a popularname short TTLs cause caches to clear quickly

This example shows the importance of DNSrsquos multiple methodsof resilience (caching retries and at least some availability at oneauthoritative) It suggests that CDN operators may wish to considerlonger timeouts to allow caching to help and give DNS operatorsdeploy defenses Experiment H suggests 30 minutes Figure 8c

Configuring short TTLs serves a role in CDNs that use DNSto direct clients to different application-level servers Short TTLsallow for re-provisioning during DDoS attacks on web servers butthat leaves DNS servers vulnerable This tension suggests trafficscrubbing by routing changeswith longDNS TTLsmay be preferredto short DNS TTLs so that both layers can be robust Howeverthe complexity of interactions between DNS at multiple levels andCDNs suggests that more study is needed before recommendingspecific settings

Finally this evaluation helps complete our picture of DNS la-tency and reliability for DNS services that may consist of multipleauthoritatives some or all using IP anycast with multiple sites Tominimize latency prior work has shown a single authoritative usingIP anycast should maximize geographic dispersion of sites [46] Thelatency of an overall DNS service with multiple authoritatives canbe limited by the one with largest latency [27] Prior work aboutresilience to DDoS attack has shown that individual IP anycastsites will suffer under DDoS as a function of the attack traffic thatsite receives relative to its capacity [23] We show that the overallreliance of a DNS service composed of multiple authoritatives usingIP anycast tends to be as resilient as the strongest individual author-itative The reason for these opposite results is that in both casesrecursive resolvers will try all authoritatives of a given service Forlatency they will sometimes choose a distant authoritative butfor resilience they will continue until they find the most availableauthoritative

9 CONCLUSIONSThis paper represents the first study of how the DNS resolutionsystem behaves when authoritative servers are under DDoS attackCaching and retries at recursive resolvers are key factors in thisbehavior We show that together caching and retries by recursiveresolvers greatly improve the resilience of the DNS as a whole In

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 13: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

multi-level recursives by large operators as well as TTL manip-ulation Our work builds on this prior work examining cachingand TTL manipulation systematically and considering its effectson resilience

DNS client behavior Yu et al investigated how stubs and re-cursives select authoritative servers and were the first to demon-strate the large number of retries when all authoritatives are un-available [56] We also investigated how recursives select authori-tative servers in the wild and found that recursives tend to preferauthoritatives with shorter latency but query all authoritatives fordiversity [27] We confirm Yursquos work and focus on authoritativeselection during DDoS from several perspectives

Authoritatives during DDoS We investigated how the RootDNS service behaved during the Nov 2015 DDoS attacks [23] Thisreport focuses on the interactions of IP anycast and both latencyand reachability as seen from RIPE Atlas Rather than look at ag-gregate behavior and anycast our methodology here examines howclients interact with their recursive resolvers while this prior workfocused on authoritatives only bypassing recursives In additionhere we have full access to clients and authoritatives traffic dur-ing our experiments and we evaluate DDoS with controlled lossrates The prior study has incomplete data and focuses on specificresults of two events These differences stem from their study ofnatural experiments from real-world events and our controlledexperiments

8 IMPLICATIONSWe evaluated DNS resilience showing that caches and retries canmitigate much of the harm from a DDoS attack provided the cacheis full and some requests can get to authoritative servers The keyimplication of our study is to explain differences in the outcome ofrecent DDoS attacks

Recent DDoS attacks on DNS services have seen very differ-ent outcomes for users The Root Server System was a target inNov 2015 [41] and June 2016 [42] The DNS Root has 13 letterseach an authoritative ldquoserverrdquo implemented with some or many IPanycast instances Analysis of these DDoS events showed that theireffects were uneven across letters for some most or all anycastinstances showed high loss while other letters showed little or noloss [23] However the Root Operators state ldquoThere are no knownreports of end-user visible error conditions during and as a resultof this incident Because the DNS protocol is designed to cope withpartial reachability rdquo [41]

In Oct 2016 a much larger attack was directed at Dyn a providerof DNS service for many second-level domains [14] Although Dynhas a capable infrastructure and immediately took steps to addressservice problems there were reports of user-visible service disrup-tion in the technical and even popular press [34] Reports describeintermittent failure of prominent websites including ldquoTwitter Net-flix Spotify Airbnb Reddit Etsy SoundCloud and The New YorkTimesrdquo each a direct or indirect customer of Dyn at the time

Our work helps explain these very different outcomes The RootDNS saw few or no user-visible problems because data in the rootzone is cachable for a day or more and because multiple letters andmany anycast instances were continuously available (All measure-ments in this paragraph are as of 2018-05-22) Records in the rootzone have TTLs of 1 to 6 days and wwwroot-serversorg reports

922 anycast instances operating across the 13 authoritative serversDyn also operates a large infrastructure (httpsdyncomdnsnetwork-map reports 20 ldquofacilitiesrdquo) and faced a larger attack (reportsof 12 Tbs [47] compared to estimates of 35 Gbs for the Nov 2015root attack [23]) But a key difference is all of the Dynrsquos customerslisted above use DNS-based CDNs (for a description see [7]) withmultiple Dyn-hosted DNS components with TTLs that range from120 to 300 s

In addition to explaining the effects our experiments help get tothe root causes behind these outcomes Users of the Root benefitedfrom caching and saw performance like Experiment E (Figure 8a)because root contents (TLDs like com and country codes) are popu-lar and certainly cached in recursives and because some root letterswere always available to refresh caches (either through a successfulnormal query or a retry) By contrast users requiring domainswith very short TTLs (like the websites that had problems) receiveperformance more like Experiment I (Figure 8d) or Experiment C(Figure 6c) Even when some requests succeed an cache a popularname short TTLs cause caches to clear quickly

This example shows the importance of DNSrsquos multiple methodsof resilience (caching retries and at least some availability at oneauthoritative) It suggests that CDN operators may wish to considerlonger timeouts to allow caching to help and give DNS operatorsdeploy defenses Experiment H suggests 30 minutes Figure 8c

Configuring short TTLs serves a role in CDNs that use DNSto direct clients to different application-level servers Short TTLsallow for re-provisioning during DDoS attacks on web servers butthat leaves DNS servers vulnerable This tension suggests trafficscrubbing by routing changeswith longDNS TTLsmay be preferredto short DNS TTLs so that both layers can be robust Howeverthe complexity of interactions between DNS at multiple levels andCDNs suggests that more study is needed before recommendingspecific settings

Finally this evaluation helps complete our picture of DNS la-tency and reliability for DNS services that may consist of multipleauthoritatives some or all using IP anycast with multiple sites Tominimize latency prior work has shown a single authoritative usingIP anycast should maximize geographic dispersion of sites [46] Thelatency of an overall DNS service with multiple authoritatives canbe limited by the one with largest latency [27] Prior work aboutresilience to DDoS attack has shown that individual IP anycastsites will suffer under DDoS as a function of the attack traffic thatsite receives relative to its capacity [23] We show that the overallreliance of a DNS service composed of multiple authoritatives usingIP anycast tends to be as resilient as the strongest individual author-itative The reason for these opposite results is that in both casesrecursive resolvers will try all authoritatives of a given service Forlatency they will sometimes choose a distant authoritative butfor resilience they will continue until they find the most availableauthoritative

9 CONCLUSIONSThis paper represents the first study of how the DNS resolutionsystem behaves when authoritative servers are under DDoS attackCaching and retries at recursive resolvers are key factors in thisbehavior We show that together caching and retries by recursiveresolvers greatly improve the resilience of the DNS as a whole In

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 14: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

fact they can largely cover over partial DDoS attacks for manyusersmdasheven with a DDoS resulting in 90 packet loss and lastinglonger than the cache timeout more than half of VPs get answerswith 30 minute caches (Figure 8c) and about 40 of VPs get answers(Figure 8d) even with minimal duration caches

The primary cost of DDoS for users can be greater latency buteven this penalty is uneven across users with a few getting muchgreater latency while some see no or little change Finally we showthat one result retries is that traffic from legitimate users to author-itatives greatly increases (up to 8times) during service interruptionand that this effect is magnified by complex multi-layer recursiveresolver systems The key outcome of work is to quantify the impor-tance of caching and retries in recursives to resilience encouraginguse of at least moderate TTLs wherever possible

AcknowledgmentsThe authors would like to thank Jelte Jansen Benno Overeinder MarcGroeneweg Wes Hardaker Duanne Wessels Warren Kumari SteacutephaneBortzmeyer Maarten Aertsen Paul Hoffman our shepherd Mark Allmanand the anonymous IMC reviewers for their valuable comments on paperdrafts

This research has been partially supported by measurements obtainedfrom RIPE Atlas an open measurements platform operated by RIPE NCCas well as by the DITL measurement data made available by DNS-OARC

Giovane C M Moura Moritz Muumlller and Marco Davids developed thiswork as part of the SAND project (httpwwwsand-projectnl)

John Heidemannrsquos research is partially sponsored by the Air Force Re-search Laboratory and the Department of Homeland Security under agree-ments number FA8750-17-2-0280 and FA8750-17-2-0096 The US Govern-ment is authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon

REFERENCES[1] 1111 2018 The Internetrsquos Fastest Privacy-First DNS Resolver https1111

https1111[2] Mario Almeida Alessandro Finamore Diego Perino Narseo Vallina-Rodriguez

and Matteo Varvello 2017 Dissecting DNS Stakeholders in Mobile NetworksIn Proceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies (CoNEXT rsquo17) ACM New York NY USA 28ndash34httpsdoiorg10114531433613143375

[3] Manos Antonakakis Tim April Michael Bailey Matt Bernhard Elie BurszteinJaime Cochran Zakir Durumeric J Alex Halderman Luca Invernizzi MichalisKallitsis Deepak Kumar Chaz Lever Zane Ma Joshua Mason Damian MenscherChad Seaman Nick Sullivan Kurt Thomas and Yi Zhou 2017 Understandingthe Mirai Botnet In Proceedings of the 26th USENIX Security Symposium USENIXVancouver BC Canada 1093ndash1110 httpswwwusenixorgsystemfilesconferenceusenixsecurity17sec17-antonakakispdf

[4] Arbor Networks 2012 Worldwide Infrastructure Security Report Technical Report2012 Volume VIII Arbor Networks httpwwwarbornetworkscomresourcesinfrastructure-security-report

[5] Vaibhav Bajpai Steffie Eravuchira Juumlrgen Schoumlnwaumllder Robert Kisteleki andEmile Aben 2017 Vantage Point Selection for IPv6 Measurements Benefitsand Limitations of RIPE Atlas Tags In IFIPIEEE International Symposium onIntegrated Network Management (IM 2017) Lisbon Portugal

[6] Vaibhav Bajpai Steffie Jacob Eravuchira and Juumlrgen Schoumlnwaumllder 2015 LessonsLearned from using the RIPE Atlas Platform for Measurement Research SIG-COMM Comput Commun Rev 45 3 (July 2015) 35ndash42 httpwwwsigcommorgsitesdefaultfilesccrpapers2015July0000000-0000005pdf

[7] Matt Calder Ashley Flavel Ethan Katz-Bassett Ratul Mahajan and JitendraPadhye 2015 Analyzing the Performance of an Anycast CDN In Proceedings ofthe ACM Internet Measurement Conference ACM Tokyo Japan httpsdoiorg10114528156752815717

[8] Ricardo de Oliveira Schmidt John Heidemann and Jan Harm Kuipers 2017 Any-cast Latency How Many Sites Are Enough In Passive and Active Measurements(PAM) 188ndash200

[9] DNS OARC 2018 DITL Traces and Analysis httpswwwdns-oarcnetindexphpoarcdataditl2018

[10] R Elz and R Bush 1997 Clarifications to the DNS Specification RFC 2181(Proposed Standard) 14 pages httpsdoiorg1017487RFC2181 Updated by

RFCs 4035 2535 4343 4033 4034 5452[11] R Elz R Bush S Bradner and M Patton 1997 Selection and Operation of

Secondary DNS Servers RFC 2182 (Best Current Practice) 11 pages httpsdoiorg1017487RFC2182

[12] Google 2018 Public DNS httpsdevelopersgooglecomspeedpublic-dnshttpsdevelopersgooglecomspeedpublic-dns

[13] Shuai Hao and Haining Wang 2017 Exploring Domain Name Based Features onthe Effectiveness of DNS Caching SIGCOMM Comput Commun Rev 47 1 (Jan2017) 36ndash42 httpsdoiorg10114530410273041032

[14] Scott Hilton 2016 Dyn Analysis Summary Of Friday October 21 Attack Dyn bloghttpsdyncomblogdyn-analysis-summary-of-friday-october-21-attack

[15] Paul Hoffman Andrew Sullivan and K Fujiwara 2018 DNS TerminologyInternet Draft httpsdatatrackerietf orgdocdraft-ietf-dnsop-terminology-bisinclude_text=1

[16] ICANN 2014 RSSAC002 RSSAC Advisory on Measurements of the Root ServerSystem httpswwwicannorgensystemfilesfilesrssac-002-measurements-root-20nov14-enpdf

[17] ISC BIND 2018 Chapter 6 BIND 9 Configuration Reference httpsftpiscorgiscbind9cur910docarmBv9ARMch06html

[18] Sam Kottler 2018 February 28th DDoS Incident Report | Github Engineering httpsgithubengineeringcomddos-incident-report

[19] D Lawrence andW Kumari 2017 Serving Stale Data to Improve DNS Resiliency-02 Internet Draft httpswwwietf orgarchiveiddraft-tale-dnsop-serve-stale-02txt

[20] PV Mockapetris 1987 Domain names - concepts and facilities RFC 1034(Internet Standard) 55 pages httpsdoiorg1017487RFC1034 Updated byRFCs 1101 1183 1348 1876 1982 2065 2181 2308 2535 4033 4034 4035 43434035 4592 5936 8020

[21] PV Mockapetris 1987 Domain names - implementation and specification RFC1035 (Internet Standard) 55 pages httpsdoiorg1017487RFC1035 Updatedby RFCs 1101 1183 1348 1876 1982 1995 1996 2065 2136 2181 2137 23082535 2673 2845 3425 3658 4033 4034 4035 4343 5936 5966 6604 7766

[22] Carlos Morales 2018 February 28th DDoS Incident Report | Github Engineer-ingNETSCOUT Arbor Confirms 17 Tbps DDoS Attack The Terabit Attack EraIs Upon Us httpswwwarbornetworkscomblogasertnetscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-era-upon-us

[23] Giovane C M Moura Ricardo de O Schmidt John Heidemann Wouter B deVries Moritz Muumlller Lan Wei and Christian Hesselman 2016 Anycast vs DDoSEvaluating the November 2015 Root DNS Event In Proceedings of the ACM InternetMeasurement Conference httpsdoiorg10114529874432987446

[24] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O Schmidtand Marco Davids 2018 Datasets from ldquoWhen the Dike Breaks Dissecting DNSDefenses During DDoSrdquo (May 2018) Web page httpsantisiedudatasetsdnsMoura18a_data

[25] Giovane CMMoura JohnHeidemannMoritzMuumlller Ricardo de O Schmidt andMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses During DDoS(extended) Technical Report ISI-TR-725b USCInformation Sciences Institutehttpswwwisiedu7ejohnhPAPERSMoura18ahtml (updated Sept 2018)

[26] Giovane C M Moura John Heidemann Moritz Muumlller Ricardo de O SchmidtandMarco Davids 2018 When the Dike Breaks Dissecting DNS Defenses DuringDDoS (extended) In Proceedings of the ACM Internet Measurement Conferencehttpswwwisiedu7ejohnhPAPERSMoura18bhtml

[27] Moritz Muumlller Giovane C M Moura Ricardo de O Schmidt and John Heide-mann 2017 Recursives in the Wild Engineering Authoritative DNS Servers InProceedings of the ACM Internet Measurement Conference London UK 489ndash495httpsdoiorg10114531313653131366

[28] NL Netlabs 2018 NL Netlabs Documentation - Unbound - undboundconf5httpsnlnetlabsnldocumentationunboundunboundconf

[29] OpenDNS 2018 Setup Guide OpenDNS httpswwwopendnscomsetupguidehttpswwwopendnscomsetupguide

[30] Jianping Pan Y Thomas Hou and Bo Li 2003 An overview of DNS-based serverselections in content distribution networks Computer Networks 43 6 (2003)695ndash711

[31] Jeffrey Pang Aditya Akella Anees Shaikh Balachander Krishnamurthy andSrinivasan Seshan 2004 On the Responsiveness of DNS-based Network ControlIn Proceedings of the 4th ACM SIGCOMMConference on Internet Measurement (IMCrsquo04) ACM New York NY USA 21ndash26 httpsdoiorg10114510287881028792

[32] Jeffrey Pang James Hendricks Aditya Akella Roberto De Prisco Bruce Maggsand Srinivasan Seshan 2004 Availability Usage and Deployment Characteristicsof the Domain Name System In Proceedings of the 4th ACM SIGCOMM Conferenceon Internet Measurement (IMC rsquo04) ACM New York NY USA 1ndash14 httpsdoiorg10114510287881028790

[33] Paul Vixie and Gerry Sneeringer and Mark Schleifer 2002 Events of 21-Oct-2002httpcroot-serversorgoctober21txt

[34] Nicole Perlroth 2016 Hackers Used New Weapons to Disrupt Major WebsitesAcross US New York Times (Oct 22 2016) A1 httpwwwnytimescom20161022businessinternet-problems-attackhtml

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 15: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

[35] Nicole Perlroth 2016 Tally of Cyber Extortion Attacks on Tech CompaniesGrows New York Times Bits Blog httpbitsblogsnytimescom20140619tally-of-cyber-extortion-attacks-on-tech-companies-grows

[36] Alec Peterson 2017 EC2 resolver changing TTL on DNS answers Post on theDNS-OARC dns-operations mailing list httpslistsdns-oarcnetpipermaildns-operations2017-November017043html

[37] Quad9 2018 Quad9 | Internet Security amp Privacy In a Few Easy Steps httpsquad9net

[38] RIPE NCC 2017 RIPE Atlas Measurement IDS httpsatlasripenetmeasurementsID ID is the experiment ID TTL60 10443671 TTL1800 10507676TTL3600 10536725 TTL86400 10579327 TTL3600-10min 10581463 A10859822B 11102436 C 11221270 D11804500 E 11831403 F 11831403 G 12131707H12177478 I 12209843

[39] RIPE NCC Staff 2015 RIPE Atlas A Global Internet Measurement NetworkInternet Protocol Journal (IPJ) 18 3 (Sep 2015) 2ndash26

[40] RIPE Network Coordination Centre 2018 RIPE Atlas - Raw data structuredocumentationshttpsatlasripenetdocsdata_struct

[41] Root Server Operators 2015 Events of 2015-11-30 httproot-serversorgnewsevents-of-20151130txt

[42] Root Server Operators 2016 Events of 2016-06-25 Technical Report Root ServerOperators httpwwwroot-serversorgnewsevents-of-20160625txt

[43] Root Server Operators 2017 Root DNS httproot-serversorg[44] Joseacute Jair Santanna Roland van Rijswijk-Deij Rick Hofstede Anna Sperotto Mark

Wierbosch Lisandro Zambenedetti Granville and Aiko Pras 2015 BootersmdashAn Analysis of DDoS-as-a-Service Attacks In Proceedings of the 14th IFIPIEEEInteratinoal Symposium on Integrated NetworkManagement IFIP Ottowa Canada

[45] D Schinazi and T Pauly 2017 Happy Eyeballs Version 2Better Connectivity UsingConcurrency RFC 8305 Internet Request For Comments httpsdoiorg1017487RFC8305

[46] Ricardo de O Schmidt John Heidemann and Jan Harm Kuipers 2017 AnycastLatency How Many Sites Are Enough In Proceedings of the Passive and ActiveMeasurement Workshop Springer Sydney Australia 188ndash200 httpwwwisiedu7ejohnhPAPERSSchmidt17ahtml

[47] Bruce Schneier 2016 Lessons From the Dyn DDoS Attack blog httpswwwschneiercomessaysarchives201611lessons_from_the_dynhtml

[48] Kyle Schomp Tom Callahan Michael Rabinovich and Mark Allman 2013 Onmeasuring the client-side DNS infrastructure In Proceedings of the 2015 ACMConference on Internet Measurement Conference ACM 77ndash90

[49] Somini Sengupta 2012 After Threats No Signs of Attack by Hackers New YorkTimes (Apr 1 2012) A1 httpwwwnytimescom20120401technologyno-signs-of-attack-on-internethtml

[50] SIDN Labs 2017 nl stats and data httpstatssidnlabsnl[51] Matthew Thomas and Duane Wessels 2015 A study of caching behavior with

respect to root server TTLs DNS-OARC httpsindicodns-oarcnetevent24contributions374

[52] Unbound 2018 Unbound Documentation httpswwwunboundnetdocumentationunboundconfhtml

[53] Lan Wei and John Heidemann 2017 Does Anycast Hang up on You In IEEEInternational Workshop on Traffic Monitoring and Analysis (Dublin Ireland

[54] Weinberg M Wessels D 2016 Review and analysis of attack traffic against A-root and J-root onNovember 30 andDecember 1 2015 In DNSOARC 24 ndash BuenosAires Argentina httpsindicodns-oarcnetevent22session4contribution7

[55] Maarten Wullink Giovane CM Moura Moritz Muumlller and Cristian Hesselman2016 ENTRADA A high-performance network traffic data streaming warehouseIn Network Operations and Management Symposium (NOMS) 2016 IEEEIFIP IEEE913ndash918

[56] Yingdi Yu Duane Wessels Matt Larson and Lixia Zhang 2012 Authority ServerSelection in DNS Caching Resolvers SIGCOMM Comput Commun Rev 42 2(March 2012) 80ndash86 httpsdoiorg10114521853762185387

A WHICH TTL AFFECTS THE CACHEREFERRALS OR ANSWERS

Cache estimations in sect3 and sect4 assumes each record has a clearTTL However glue records bridge zones that are in different au-thoritative servers and can result in two different TTL values for arecord This section examines which TTL takes precedence

A1 The Problem of Glue RecordsDNS employs glue records to identify name servers that are in theserved zone itself For example the Canadian national domain cais served from ca-serversca but to look up ca-serversca one mustknow where ca is To break this cycle records to identify thesesevers (ldquoglue recordsrdquo) are placed in the parent zone (the root in

this case) With these records in two places they may have differentTTLs We next evaluate which TTL takes priority in this case

If we ask the Root DNS servers for the NS records of ca weobtain the answer shown in Listing 1$ d ig ns ca b root minus s e r v e r s ne t AUTHORITY SECTION ca 172800 IN NS any caminus s e r v e r s ca ca 172800 IN NS j caminus s e r v e r s ca ca 172800 IN NS x caminus s e r v e r s ca ca 172800 IN NS c caminus s e r v e r s ca Listing 1 Referral answer to dig ns ca anyca-serversca

Notice that the answer for the NS records is in the ldquoAuthoritySectionrdquo this signals that this answer is known as ldquoreferralrdquo atype of answer in which ldquoin which a server signaling that it is not(completely) authoritative for an answer provides the queryingresolver with an alternative place to send its queryrdquo [15] In thiscase the AA bit in the answer is clear

Given that the Root servers are not are not authoritative for ca arecursive can ask the same query directly to any of the ca serversListing 2 shows the results authanswer$ d ig ns ca any caminus s e r v e r s ca ANSWER SECTION ca 86400 IN NS c caminus s e r v e r s ca ca 86400 IN NS j caminus s e r v e r s ca ca 86400 IN NS x caminus s e r v e r s ca ca 86400 IN NS any caminus s e r v e r s ca

Listing 2 Answer to dig ns ca anyca-serversca

Notice that the records on this response from anyca-serverscaare presented in ldquoANSWER SECTIONrdquo (and not on ldquoAUTHORITYSECTIONrdquo) and the AA bit is set

Even though the records are the same as in the case of the Rootstheir TTL is different 2 days and 1 day respectively The samebehavior can be observed for the A records of the ca servers (dig Aca broot-serversnet and dig A ca anyca-serversca)

We refer to this as TTL duplication the same record is defined inboth authoritative servers and on its respective parent node serveras a glue record

This duplication poses a challenge for recursive resolvers whichvalue should the trust The one from the glue or the other from theauthoritative name server

In theory the answer from the authoritative should be the oneused since they are the servers who are authoritative for the zoneThe roots in this example are not ldquocompletelyrdquo authoritative [15]This behavior can be observed in various zones as well not only atthe roots

Assuming that the records are the same as for ca in this sec-tion we investigate in the wild which one resolvers get to usethe values provided by the parent or the authoritative Accordingto RFC2181 [10] a recursive should use the authoritative answerinstead (sect541)

Given the importance of TTLs in DNS resilience (cache duration)we investigate in this section which values recursives trust theparent or the authoritative name server

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 16: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

NS record A Record SourceTotal Answers 128382 98038

TTLgt3600 38 46 UnclearTTL=3600 260 233 Parent60ltTTLlt3600 6890 4621 ParentothersTTL=60 60803 46820 AuthoritativeTTLlt60 60391 46318 Authoritative

Table 5 Distribution of TTL of DNS answers for cachetestnl

A Records of amazoncomdomain IP TTLamazoncom 205251242103 60amazoncom 17632103205 60amazoncom 1763298166 60

Records amazoncomdomain NS TTL-auth TTL-comamazoncom ns1p31dynectnet 3600 172800amazoncom pdns6ultradnscouk 3600 172800amazoncom ns2p31dynectnet 3600 172800amazoncom ns4p31dynectnet 3600 172800amazoncom pdns1ultradnsnet 3600 172800amazoncom ns3p31dynectnet 3600 172800

A records of NS recordsdomain IP TTLns1p31dynectnet 208787031 86400pdns6ultradnscouk 204741151 3600ns2p31dynectnet 2041325031 86400ns4p31dynectnet 2041325131 86400pdns1ultradnsnet 204741081 3600ns3p31dynectnet 208787131 86400Table 6 amazoncom NS and TTLs on 20180803

A2 ExperimentsWe use the same setup used in sect3 and configured the TTL valuesof our name servers (ns[12]cachetestnl) as following

bull Referral value ( at the nl zone) TTL value of 3600 sbull Answer value (the authoritative servers (ns[12]cachetestnl)TTL =60 s

We then use sim10 000 Atlas probes (yielding sim15 000 vantagepoints) to query their local recursives for the NS records of cachetestnl (Ripe measurement ID 15642129) We did the same for Arecords in another measurement (ID 15628702) We then analyzethe distribution of TTL values of the answers obtained

Table 5 shows the distribution of TTLs for this two experimentsAs can be seen the large majority of answers (about 95) receivethe TTL values as provided by the authoritative servers and not theone provided by the parent nl server As such we can conclude thatwith these two experiments for A and NS records most recursiveswill forward their clients the TTL provided by the authoritatives ofa zone

A3 Looking At a Recursiversquos CacheIn sectA2 we showed that in the wild most recursives will serve theirclients with the TTL provided by authoritative and not by referrals

In this section we look into the caching of Unbound DNS serverwhen we query for the NS records from amazoncom Similarly toca example of the previous section these records have two TTLvalues 3600 s a their authoritative 172800 s at the parent comserver as can be seen in Table 6

To determine which specific record two popular recursives soft-ware use we analyze both Unbound (version 158) and BIND (9103)

cache contents We start the servers with cold cache and send onlyone query dig ns amazoncom We them dump the contents ofthe cache of these servers and see which values they store

Listing 3 and Listing 4 show the contents of the cache of bothBIND and Unbound after send the aforementioned query As canbe seen the values stored in the cache are the ones (decrementedby few seconds when we dumped its contents) provided by the au-thoritative server and not by the parents authoritative as referrals authansweramazon com 3595 NS ns1 p31 dynec t ne t 3595 NS ns2 p31 dynec t ne t 3595 NS ns3 p31 dynec t ne t 3595 NS ns4 p31 dynec t ne t 3595 NS pdns1 u l t r a d n s ne t 3595 NS pdns6 u l t r a d n s co uk

Listing 3 BIND cache dump excerpt for amazoncom

r r s e t 3594 6 0 7 3amazon com 3594 IN NS pdns1 u l t r a d n s ne t amazon com 3594 IN NS ns1 p31 dynec t ne t amazon com 3594 IN NS ns3 p31 dynec t ne t amazon com 3594 IN NS pdns6 u l t r a d n s co uk amazon com 3594 IN NS ns4 p31 dynec t ne t amazon com 3594 IN NS ns2 p31 dynec t ne t Listing 4 Unbound cache dump excerpt for amazoncom

A4 Discussion and ImplicationsUnderstanding how glue records affect TTLs is important to showthat zone operators are in fact in charge of how long their NSrecords will remain in their clients cache They should choose thesevalues carefully as discussed in sect8

B TTL MANIPULATIONS ACROSS DATASETSIn sect34 we saw that there are a number of AC-type answersmdashqueriesthat are answered by the authoritative even though we expectedthem to be served from a cache We now investigate the relationbetween the number of AC-answers with regards the TTL used forthe DNS records

Figure 13 shows response types over time for our four differentTTLs Analyzing these figures we can see that with exception of a60 s where all queries go to the AA the number of AC answers isrelatively constant This consistency suggests cache fragmentationacross the datasets We also see that AA and CC values alternategiven as caches entries expires new queries are forwarded to theauthoritatives

C LIST OF PUBLIC RESOLVERSWe use the following list of public resolvers in sect34 This list isobtained by searching DuckDuckGo for ldquopublicrdquo and ldquodnsrdquo on 2018-01-0619810124272 Alternate DNS2325316353 Alternate DNS2052048860 BlockAid Public DNS (or PeerDNS)1782123150 BlockAid Public DNS (or PeerDNS)91239100100 Censurfridns892334371 Censurfridns

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 17: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(a) TTL 60 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(b) TTL 1800 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(c) TTL 3600 s

0 2500 5000 7500

10000 12500 15000 17500 20000

0 20 40 60 80 100

answ

ers

mintues after start

AA AC CC CA

(d) TTL 86400 s (1 day)

0

2500

5000

7500

10000

12500

15000

17500

20000

0 20 40 60 80 100

an

sw

ers

mintues after start

AA AC CC CA

(e) TTL 3600 s (1 hour) Probing Interval T=10min

Figure 13 Fraction of query answer types over time200167c28a4 Censurfridns2002d5962a9217153 Censurfridns213739135 Chaos Computer Club Berlin20959210167 Christoph HochstAtildeďtter8521411711 Christoph HochstAtildeďtter

212822257 ClaraNet21282226212 ClaraNet8265626 Comodo Secure DNS82024720 Comodo Secure DNS842006980 DNSWatch842007040 DNSWatch2001160810251c04b12f DNSWatch2001160810259249d69b DNSWatch10423621029 DNSReactor455515525 DNSReactor2161463535 Dyn2161463636 Dyn806716912 FDN200191080012 FDN852147363 FoeBud87118111215 FoolDNS2131871162 FoolDNS372351174 FreeDNS372351177 FreeDNS80808080 Freenom World80808181 Freenom World87118100175 German Privacy Foundation eV947522829 German Privacy Foundation eV8525251254 German Privacy Foundation eV621415813 German Privacy Foundation eV8888 Google Public DNS8844 Google Public DNS2001486048608888 Google Public DNS2001486048608844 Google Public DNS8121811911 GreenTeamDNS20988198133 GreenTeamDNS74824242 Hurricane Electric2001470202 Hurricane Electric20924403 Level320924404 Level3156154701 Neustar DNS Advantage156154711 Neustar DNS Advantage54596220 New Nations1858222133 New Nations1981531921 Norton DNS1981531941 Norton DNS20867222222 OpenDNS20867220220 OpenDNS26200ccc2 OpenDNS26200ccd2 OpenDNS58611542 OpenNIC58611543 OpenNIC1193123042 OpenNIC20025298162 OpenNIC21779186148 OpenNIC8189986 OpenNIC7815910137 OpenNIC203167220153 OpenNIC82229244191 OpenNIC2168784211 OpenNIC662449520 OpenNIC20719269155 OpenNIC7214189120 OpenNIC20014708388220e2efffe63d4a9 OpenNIC20014701f0738b1 OpenNIC20014701f10c62001 OpenNIC19414522626 PowerNS

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 18: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(a) Experiment D 1800-50p-10min-ns1 arrows indicate DDoS start andrecovery

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

cache-onlynormal normal

answ

ers

minutes after start

OK SERVFAIL No answer

(b) Experiment G 300-75p-10min-2Nses arrows indicate DDoS start andrecovery

Figure 14 Answers received during DDoS attacks (extragraphs)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(a) Experiment D 50 packet loss on 1 NS only (1800 s TTL)

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160

late

ncy

(ms)

minutes after start

Median RTTMean RTT75ile RTT90ile RTT

(b) Experiment G 75 packet loss (300 s TTL)

Figure 15 Latency results Shaded area indicates the intervalof an ongoing DDoS attack (extra graphs)

7722023244 PowerNS9999 Quad92620fefe Quad9195463939 SafeDNS195463940 SafeDNS19358251251 SkyDNS208765050 SmartViper Public DNS208765151 SmartViper Public DNS784689147 ValiDOM8819875145 ValiDOM646646 Verisign646656 Verisign2620741b11 Verisign2620741c22 Verisign77109148136 Xialanet77109148137 Xialanet200116202078136 Xialanet200116202078137 Xialanet7788888 YandexDNS778882 YandexDNS2a026b8feedbad YandexDNS2a026b801feedbad YandexDNS10969851 puntCAT

D EXTRA GRAPHS FROM PARTIAL DDOSAlthough we carried out all experiments listed in Table 4 (sect5) thebody of the paper provides graphs for key results only In thissection we include the graphs for experiments D and G

Figure 14 shows the number of answers for experiments D and Gwhile Figure 15 shows the latency results We can see at Figure 14athat when 1 NS fails (50) there is no significant changes in thenumber of answered queries Besides most of the users will notnotice in terms of latency (Figure 15a)

Figure 14b shows the results for Experiment G in which the TTLis 300 s (thus half of the probing interval) Similarly to experimentI this experiment should not have caching active given the TTL isshorter Moreover it has a packet loss rate of 75 At this rate wecan see that the far majority (72) of users obtain an answer witha certain latency increase (Figure 15b)

E ADDITIONAL INFORMATION ABOUTSOURCES OF RETRIES

In sect62 we summarized sources of retries Here we provide ad-ditional experiments to understand how many retries occur incurrent DNS software These experiments confirm prior work byYu et al [56] still holds We examine two popular recursive resolverimplementations BIND 9103 and Unbound 158

To evaluate these implementations we deploy our own recursiveresolvers and record traffic with the authoritatives (cachetestnetnot cachetestnl as in the rest of the paper) up and then with theminaccessible For consistency we only retrieve AAAA records forsubcachetestnet The recursives operates normally learning aboutthe authoritatives from net and querying one or both based on itsinternal algorithms All queries are made with a cold cache In total

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 19: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

When the Dike Breaks Dissecting DNS Defenses During DDoS (extended) ISI-TR-724b May 2018 (updated September 2018)

at1 at2

rn1 rn2 rn3 rn4 rn5 rn6 rn7 rn8

r1a r1b r1c

probe

authoritativeservers

recursives

stub resolveratlas probe

Figure 17 probe 284777 on dataset i all r1rsquos are connected toall rnrsquos which are connected to all at rsquos

0

10

20

30

40

50

BIND Unbound BIND (DDoS) Unbound (DDoS)

qu

erie

s

root net

cachetestnet

Figure 16 Histogram of queries from recursives resolvers

we send 100 queries per experiment and present in this section theresults

We start by analyzing traffic monitored at the recursive resolversWe consider only queries sent from the recursives towards any ofthe authoritative servers in the cachetestnet hierarchy the RootDNS servers the net authoritative servers and the authoritativeservers for cachetestnet

Figure 16 shows the results of this experiment with normalbehavior in the left two groups and the inaccessible DDoS attemptsin the right two From this figure we see that in normal operationa few queries are required to look up the target domain the serverlooks up AAAA records for both nameservers and then queries theAAAA record for the target (Unbound also queries for the AAAArecords of the authoritative name serves of cachetestnet whichare not in the net glue neither in the cachetestnet zone whichexplains the difference with BIND which does not)

Under normal operations it takes BIND 3 DNS queries (1 to theroot one to net and one to cachetestnet authoritative servers) toresolve the AAAA records of subcachetestnet Unbound in turn

takes 5 queriesmdashit sends 2more to the authoritatives of cachetestnetasking for AAAA records of their NS records which do not exist(thus this difference can be neglected since it may be specific to oursetup)

The two groups on the right of Figure 16 show much moreaggressive queries during server failure Both BIND and Unboundsend far more DNS queries to all authoritative servers available Wesee that BIND sends 12 queries instead of 3 it does under normaloperations (4times increase) while Unbound sends 46 queries in total(146times increase) Unbound keeps on trying to resolve the domainsand the AAAA records of the authoritative name servers (the latterdoes not exist which it is specific to our domain 30 in total out of46) If we disregard these queries Unbound would then increase thetotal number of queries from 6 to 16 a 64times increase (We repeatedthis experiment 100 times and all results were consistent)

Another difference between BIND and Unbound is that whencachetestnet is unreachable BIND will ask both parents (the Rootsand net) for its records again (shown by increase in the root and netportions of bind DDoS vs bind non-DDoS in Figure 16) Unbounddoes not retry these queries

This experiment is consistent with our observation that we see7times more traffic than usual during near-complete failure We do notmean to suggest that these retries are incorrectmdashrecursives need toretry to account for packet lossF ANALYSIS OF RETRIES FROM A SPECIFIC

PROBEIn sect6 we examined changes in legitimate traffic during a DDoSevent showing traffic greatly increases because of retries by eachof possibly several recursive resolvers That analysis looked ataggregate statistics We next examine a specific probe to confirmour evaluation

We consider RIPE Atlas Probe 28477 in Experiment I This probeis located in Australia It has three ldquolocal recursivesrdquo (R1a R1b R1c )and 8 ldquolast layerrdquo recursives (the ones that ultimately send queriesto authoritatives Rn1ndashRn8) as can be seen in Figure 17

In this section we analyze all the incoming queries arriving onour authoritatives for AAAA records of 28477cachetestnl We donot analyze NS queries since we cannot associate them to thisprobe

We then analyze data collected at both the client side (Ripe Atlasprobes) and the server side (Authoritative servers) We show thesummary of the results in Table 7 In this table we highlight in lightcolor the probing intervals T in which the simulated DDoS started(dataset I 90 packet drop on both authoritative servers ndash Table 4)

Client Side Analyzing this table we can learn the followingfrom the Client side

During normal operations we see 333 at the client side (3 queries3 answers and 3 R1n used in the answers) By default each RipeAtlas probe sends a DNS query to each R1n

During the DDoS this slightly change we see 2 answers from 2R1s ndash the other either times out or SERVFAIL Thus at 90 packetloss on both authoritatives still 2 of 3 queries are answered at theclient (and there is no caching since the TTL of records is shorterthan the probing interval)

This is in fact a good result given the simulated DDoSAuthoritative side Now letrsquos consider the authoritative side

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe
Page 20: When the Dike Breaks: Dissecting DNS Defenses During DDoS ...johnh/PAPERS/Moura18a.pdf · Each of these records may have different cache lifetimes (TTLs), by choice of the operator

ISI-TR-724b May 2018 (updated September 2018) G C M Moura et al

TClient View (Atlas probe) Authoritative Server View

Queries Answers R1 Queries Answers AT s Rn Max ( total) Queries

(Rn minus AT ) (Que(Rn minus AT )) (top-2 Rnrsquos)1 3 3 3 3 3 2 2 3 1 (333) 212 3 3 3 5 5 2 4 5 1 (200) 213 3 3 3 6 6 2 6 6 1 (167) 114 3 3 3 5 5 2 2 3 2 (400) 415 3 3 3 3 3 2 3 3 1 (333) 116 3 3 3 6 6 2 5 5 2 (333) 217 3 2 2 29 3 2 2 4 10 (345) 15148 3 2 2 22 6 2 7 9 7 (318) 1079 3 2 2 21 1 2 3 6 9 (429) 16310 3 2 2 11 4 2 3 4 6 (545) 8211 3 2 2 21 3 2 4 6 15 (714) 17212 3 2 2 23 1 2 8 8 15 (652) 15213 3 3 3 3 3 2 3 3 1 (333) 1114 3 3 3 4 2 2 4 4 1 (250) 1115 3 3 3 8 8 2 7 7 7 (875) 1116 3 3 3 9 9 2 3 3 6 (667) 6317 3 3 3 9 9 2 5 5 5 (556) 51

Table 7 Client and Authoritative view of AAAA queries (probe 28477 measurement I)

In normal operation (before the DDoS) we see that the 3 queriessent by the clients (client view) lead to 3 to 6 queries to be send bythe Rn which reach the authoritatives Every query is answeredand both authoritatives are used ( ATs)

We also see that the number of Rn used before the attack changesfrom 2 to 6 ( Rn) This choice depends on how R1lowast choose theirupper layer recursives (Rnminus1) and ultimately how Rnminus1 chooses Rn

We also can observe how each Rn chooses each authoritative (Rn-AT) this metric show the number of unique combinations ofRn and authoritatives observed For example for T = 1 wee seethat three queries reach the authoritatives but only 2 Rn were usedBy looking at the number of unique recursive-AT combination wesee that there are three As a consequence one of the Rn must haveused both authoritatives

During the DDoS however things change we see that the still3 queries send by probe 28477 (client side) now turn into 11ndash29queries received at the authoritatives from the maximum of 6queries before the DDoS The client remains unaware of this

We see three factors at play here

(1) Increase in number of used Rn we see that the numberof number of used Rn also increases a little ( 36 averagebefore to 45 during) (However not all Rn are used all thetime during the DDoS We have not enough data of whythis happens ndash but surely is how the lower layer recursiveschooses these ones or how they are set up)

(2) Increase in the number of authoritatives used by eachRn Each Rn in turns contacts an average of 113 author-itatives before the attack (average of UniqRn-ATUniqRn)During the DDoS this number increases to 137

(3) Retrials by each recursive either by switching authorita-tives or not we found this to be the most important factorfor this probe We show the number of queries sent by thetop 2 recursives (QueriesTop2Rn) and see that compared tothe total number of queries they comprise the far majority

For this probe the answer of why so many queries is a combi-nation of factors being the retrials by Rn the most important oneFrom the client side there is no strong indication that the clientrsquosqueries are leading to such an increase in the number of queries toauthoritatives during a DDoS

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 DNS Resolvers Stubs Recursives and Authoritatives
    • 22 Authoritative Replication and IP Anycast
    • 23 DNS Caching with Time-to-Live (TTLs)
      • 3 DNS Caching In Controlled Experiments
        • 31 Potential Impediments to Caching
        • 32 Measurement Design
        • 33 Datasets
        • 34 TTL distribution expected vs observed
        • 35 Public Recursives and Cache Fragmentation
          • 4 Caching Production Zones
            • 41 Requests at nls Authoritatives
            • 42 Requests at the DNS Root
              • 5 The Clients View of Authoritatives Under DDoS
                • 51 Emulating DDoS
                • 52 Clients During Complete Authoritatives Failure
                • 53 Discussion of Complete Failures
                • 54 Client Reliability During Partial Authoritative Failure
                • 55 Client Latency During Partial Authoritative Failure
                  • 6 The Authoritatives Perspective
                    • 61 Recursive-Authoritative Traffic during a DDoS
                    • 62 Sources of Retries Software and Multi-level Recursives
                      • 7 Related Work
                      • 8 Implications
                      • 9 Conclusions
                      • References
                      • A Which TTL affects the cache referrals or answers
                        • A1 The Problem of Glue Records
                        • A2 Experiments
                        • A3 Looking At a Recursives Cache
                        • A4 Discussion and Implications
                          • B TTL Manipulations across datasets
                          • C List of public resolvers
                          • D Extra graphs from Partial DDoS
                          • E Additional Information about Sources of Retries
                          • F Analysis of Retries from A Specific Probe