Geoff Huston  0:00

    So if you were writing an RFC today, 2025 and you were trying to
    talk about this, the IETF and its entire review process wouldn't
    let you get away with vague hand waving, they would nail you down
    and go right. You've got to talk about seconds, timers, selection
    algorithms. You've got to get it to the point where anybody who
    has some slight code capability could write basically the same
    code, [George: yeah]. But it was a different world. In November
    1987 when Paul Mockapetris wrote RFC 1034 the concepts and
    facilities of the DNS, it was a very different world, [George:
    yeah]. And his standard said. Now I quote this because I think
    it's great. The sorting of name servers may involve statistics
    from past events, such as previous response times and batting
    averages. [George: Batting averages?] No, I know. I know Paul.
    He's not a cricketer, so I'm not exactly sure what he's referring
    to. Not being cricket. It must be something to do with baseball, I
    guess, batting averages, and that's the end of the advice

George Michaelson  1:16

    you're listening to ping, a podcast by APNIC discussing all things
    related to measuring the Internet, I'm your host, George
    Michaelson, this time, I'm talking to Geoff Huston from APNIC labs
    again in his first regular monthly spot on ping for 2025. Geoff
    and I discuss the DNS again, but this time looking at a quirk of
    resolver behavior. Resolvers are the part of the domain name
    system which perform queries on behalf of users. They're meant to
    consider the diversity of sources of authoritative information
    from the delegated name servers they're told about, and perform a
    sort of heuristic to periodically check that they're using the
    best one in terms of end to end delay. Geoff has been exploring
    both how this is defined and how this performs in practice using
    the APNIC labs measurement system. There are a few surprising
    outcomes from this study and a view to the future, which might be
    less about IETF standards and code changes and a lot more about
    DNS delegates making some wiser choices in who they use to run
    their authoritative DNS servers. Geoff, welcome to 2025 and
    welcome back to ping. What should we talk about this time?

Geoff Huston  2:36

George, it's a pleasure to be back. And in continuing with almost a persistent theme from 2024 I have spent more time than is healthy for normal humans inside the domain name system, sort of moving slightly away from addresses and addressing infrastructure and routing across into the other area of DNS common infrastructure, The name space. And I must admit, the DNS is deceptively complex. It's, I think, been likened by many folk who put their brain through the DNS mill like a game of chess. The rules are deceptively simple. There are very few, but the combinations are mind bending, and the DNS is indeed a remarkably complex beast.


George Michaelson  3:23

    It has a lot of moving parts, it's true, and the fundamental
    playing pieces are not that hard to describe to people, but when
    you bang them together, the machinery gets weird.

Geoff Huston  3:37

    It gets weird. There's no common manual on what you must do to
    play in the DNS, different people software do subtly different
    things. There's different operational practices out there, and so
    it's kind of a loose consortium of folk who kind of vaguely play
    by kind of the same rules, sort of, and we, some of the time, obey
    a common protocol, and other times, we kind of push it into
    strange places. And in some ways, you go it's a miracle. It's just
    a miracle that it works at all. And every time you get an answer
    back to the DNS, you should be grateful, pathetically grateful,
    because you know what occurred was unnatural. It's not a tightly
    bound machine. And oddly enough, we sit there in the DNS and even
    the folk who build it with assumptions and mythology that date
    back almost 40 years, and other parts of it kind of change before
    our very eyes. And this mix is weird. So I want to start with a
    question. Why does the root name zone, the top of the DNS
    hierarchy, have 13 name servers?

George Michaelson  4:52

    Yes, not a number we're used to seeing in arbitrary choices of
    counts. We're used to the idea of one. Is everywhere. Three is
    everywhere. Five is not unusual. 10 is not unusual. 12 duodecimal.
    You and I are old enough to remember being trained in that weird,
    mechanistic way of counting. 13. That's not a common number

Geoff Huston  5:14

    1 2 4, 8 16, what's 13 doing? A bit like, you know, why did ATM
    choose 53 octets. It's kind of Wow. What were you smoking? What
    were you thinking to come up with a number like 13? And let me
    explain the thinking at the time, because this number 13 has
    entered mythology in the DNS and quite a few folk still use 13
    name servers. You sit there and go. But the original idea was that
    what you wanted was both performance, speed and resilience. Now,
    how do you get resilience? Have more than one, so two is better
    than one. Three is better than two. Yeah, you could possibly apply
    this infinitely.

George Michaelson  6:00

    If you have a single point and you're looking for the natural
    thing to do to make it more resilient, having two ways of
    performing the function is the first go to but two very quickly
    become stale. So you want to have as many as you can. And in
    modern behavior, you actually try not to put a constraint on how
    many instances there are, although the act of saying, Get from the
    thing I want to the local instance itself can sometimes become a
    dependency that you don't like, however, that's off in the weeds.
    But you would not pick 13 as your first choice, would you?

Geoff Huston  6:36

    You wouldn't. But I suppose part of that thinking was 13 is better
    than 1214. Is better than 13. You know, you can keep on playing
    this game. But consider the thought experiment that instead of
    having 13 simultaneous outages out there somewhere on the net, you
    break the wire from your computer to all of the net, and you kind
    of have the suspicion that there's 100 name servers out there, and
    your poor blighted machine is going to ask 100 queries, 1 2 3,
    before it comes to the conclusion that nothings happening. So
    sometimes there is such a concept as too many, [George: yeah],
    because no one's that patient. And so 13 was partially a
    compromise between two is better than one. Let's keep adding and
    no one's that patient. Okay. Park that thought. But there was one
    other criteria that happened when this system was being set up in
    the 1980s and that was that these recursive resolvers, these
    people, these entities, these machines that asked the question
    were meant to examine their own performance. I just asked server
    number 52 and it took five seconds. I've just asked server number
    three, and it took 1/10th of a second. I'm going to keep on asking
    server number three, because obviously that's faster. And so the
    whole idea was you took these 13 servers and you spread them
    around the world. So if you're doing this properly, you'd kind of
    look at where there's population and put one here, one there, and
    evenly smear these unique 13 resolvers so that no one would be
    waiting for an eon to get an answer. Why is this important?
    Because we're using UDP, and UDP is a really, really odd
    algorithm. It doesn't say no, it doesn't say, I haven't got
    anything. You just don't get an answer because it's a datagram.

George Michaelson  8:35

    It's the protocol that doesn't have a protocol. It's send it and
    hope, and if up in the application layer, if you receive a
    response that in your application state makes sense, you like to
    believe the other end must have got the packet you sent, because
    otherwise, why did you get the answer? But there's no formalism of
    packet counting and tracking and state.

Geoff Huston  8:58

    When you get an answer, that's great. Go out on the street to
    celebrate. That's fantastic, but how long do you wait before you
    conclude that no answer is coming?

George Michaelson  9:07

    That's a problem for another day, said the network engineer and
    walks away from the keyboard.

Geoff Huston  9:12

    So performance matters, because if you really think, Oh, I'll wait
    for 10 seconds, I'm a very forgiving person, and I'm enormously,
    enormously patient. Then you send out a DNS query and watch the
    clock go tick, tick, tick, for 10 whole seconds before you ask the
    next server. How long would it take me to go through six servers
    an entire minute? Continents move in that time scale. We're not
    that patient, [George: yeah]. So the other part of resilience and
    UDP is actually latching pretty tightly on the server that's
    closest to you in this kind of world. So recursive resolvers were
    designed to be introspective when they ask for an authoritative
    name server. Hi, I've got a question. Question, and you're meant
    to be the entity that's going to answer me. It starts the clock
    when it sends the query, and when it gets back an answer, ya hoo.
    It says, Okay. It took you 50 milliseconds. It took you 100
    milliseconds, whatever. And if a zone like the root zone is served
    by all 13 different name servers, it will run 13 clocks, and it
    will tend to ask the one that's fastest, but just to be on the
    safe side and to make sure if anything changes, it optimizes. It
    occasionally asks the others, just to make sure that if anything
    changes, it'll sort of move towards the fastest

George Michaelson  10:40

    now, as described as a sort of heuristic, because it isn't totally
    algorithm like this. Isn't such a bad mechanism, is it Geoff?
    You're going to not depend on one thing, and you're aware that
    things are physically distributed in the world and may have
    variable delay. And we've arrived at this magic number, by the
    way, I think we elided over why 13, but we can come back to that.
    And the thing is, you want to find the one you like best, and so
    the way to do it is to try all of them periodically. Bit of a
    question there. How often do you try a new one and decide which of
    them is fastest and make that the one you prefer. And you know, I
    think as a mechanistic view of how to pick the one you prefer,
    that's not too bad.

Geoff Huston  11:28

    So let's sort of wander sideways a bit and look a little bit at
    the language of RFCs, the cannons of the Internet technology, the
    standard specifications that are meant to be so good and so well
    written, we tell ourselves that independent implementations,
    looking solely at the RFCs, can produce inter operating code. The
    whole idea of the RFC was not to just do some paperware temple, if
    you did this ideal thing, everything would happen. It's people
    have used this specification, and they have built code, and it
    works with itself. It works with other implementations. So you'd
    think something as critical as performance and resilience of the
    DNS, you'd find a specification in the RFCs that really do get
    down to the heart of this. How often should you check other name
    servers? How many name servers do you need?

George Michaelson  12:27

    Yeah, so these are not the kinds of things that when you're
    writing a definitions document about behavior, you just now make
    up your own mind. You're looking for some commonality of
    expectation and behavior. You're going to nail down frequency and
    persistence and choice and algorithmic selection. Its left hand of
    the cow, visible from the train window, is brown material. Nobody
    says there is just the brown cow. They go as far as they can go.

Geoff Huston  12:56

    So if you were writing an RFC today, 2025 and you were trying to
    talk about this, the IETF and its entire review process wouldn't
    let you get away with vague hand waving. They would nail you down
    and go right. You've got to talk about seconds, timers, selection
    algorithms. You've got to get it to the point where anybody who
    has some slight code capability could write basically the same
    code, yeah. But it was a different world. In November 1987 when
    Paul Mockapetris wrote RFC 1034 the concepts and facilities of
    the DNS, it was a very different world. And his standard said, Now
    I quote this because I think it's great. The sorting of name
    servers may involve statistics from past events, such as previous
    response times and batting averages,

George Michaelson  13:50

    batting average.

Geoff Huston  13:53

    Now I know, I know Paul. He's not a cricketer, so I'm not exactly
    sure what he's referring to. Not being cricket, it must be
    something to do with baseball, I guess, batting averages, and
    that's the end of the advice a different world.

George Michaelson  14:08

    It's got to look like this on the wire. You are incredibly
    prescriptive. How you decide where it's going and why you picked.
    You have a number of ways of choosing that I've written in the
    margin of this document, and there is not enough space to detail

Geoff Huston  14:22

    all that kind of stuff. So I started to get curious about this,
    and I did a couple of bench tests. One of the most popular
    recursive resolve is quick digression. The DNS has a number of
    different components, typically produced by different people, and
    certainly operated by different parties inside your machine,
    whether it's a hand held device, laptop, whatever you want is what
    we call a stub resolver. It's a library that says, I need to
    resolve a name who's going to help me. And normally, when you boot
    up, you get configured with a set of the addresses of so called
    recursive resolve. Was provided by your ISP normally, but other
    people can do it. And your stub resolver on your machine goes and
    asks a recursive resolver. It's normally given two or more,
    because when it doesn't answer, you go and ask the other one
    resilience, and that's kind of the end of it. So you get stub
    resolvers. Recursive resolvers do all the work normally operated
    by your ISP. But other folk do it because either they're crazy or
    because they think it's a good thing, and they actually go out
    into the DNS and start doing the discovery thing. Who is the set
    of machines that are authoritative for the name that I'm after? We
    call them name servers, but you don't know in advance, because
    there's a lot of names, so you've actually got to discover which
    name servers to use, and that means a whole bunch. You start at
    the root query the root, who are the name servers for the next
    level down and so on. Let's not worry about that, but let's look
    at the behavior of this recursive resolver when given a domain
    that has, I don't know, 60 different name servers, how many will
    it query before it says I give up? Because I'm really concerned
    about resilience. I actually provision 60 authoritative name
    servers for my domain name because I really, really, really care
    that this domain has to be always there.

George Michaelson  16:22

    I have a number of ancillary questions about this, but you know,
    let's just go with it. If there is an insane number of listed name
    servers, how far do you go? Is a really good question. Is it
    infinite? Because that actually Geoff would be an attack, right? I
    could make a label and put 200,000 NSs in there. And if I could
    make every resolver in the world have to go down the list looking
    for 200 and 1001 I'd be wasting a lot of people's time.

Geoff Huston  16:55

    Yes. Mac the best answer is, give up. Just give up. So the most
    common recursive resolver code we think out there. It's kind of
    hard to do these censuses, but the most common, we think, is bind,
    originally, the Berkeley Internet name domain software or
    something. Bind nine is the most common one out there. And I test
    it bind nine, and you set up a domain with two name servers, it
    asks two and then says neither of those are actually working. Let
    that stop you. Set up five, and it'll ask all 5 1 2 3 4 5, starts
    to take a bit of time. It asks the five over a period of nine
    seconds. Set up six name servers, and in 9.6 seconds, it says none
    of those six were answering correctly. And you can watch it query
    all six, and I'll query each of them a few times just to make sure
    that they're really dead, just not responsive. I set up seven. It
    only queries Six. Eight. Only queries six.

George Michaelson  17:52

    You perform the experiment, and you demonstrated it does have a
    built in limit.

Geoff Huston  17:57

    It stops after 10 seconds. That's the first thing, it won't try
    forever. And secondly, this recursive resolver kind of goes, Look
    six. I did push it once into querying seven, and I thought, wow,
    success. But I had to configure 13 name servers for it to query
    seven of them. So if I'd set up 100 it would still go after 10
    seconds. That's your answer. If I can't get six or seven to
    respond. That's it. So the recursive resolver kind of says, No,
    I'm not going to do this forever. So in essence, what's going on
    here is that there is an upper limit in some of these recursive
    resolvers that go, Look, time is not infinite. After 10 seconds,
    we're going to stop and I'm not going to query like a maniac,
    because that's a denial of service response. I'm going to sort of
    pace through evenly query every third of a second or so and re
    query selectively until I get to my time limit and that's it. I
    don't get an answer in 10 seconds, I'm going to say there is no
    answer. I've done what I can. It's gone.

George Michaelson  19:00

    So you're saying that the primary drive is the time bounds for the
    complete function to be performed. And if, on average, the timer
    to give up and move to the next one is of the order 10 seconds, it
    will naturally tend to hit six as a limit, because it's got an
    outer bound of a minute?

Geoff Huston  19:21

    No, no. You can set up the resolve at one millisecond away, and it
    will still be the same limit. It'll just do more queries. So it's
    kind of got two things. I'm only going to query six, possibly
    seven, because after that, there's no point. It's you, it's not
    the net. And secondly, there's an overall time limit for this
    exercise. After all these queries, they're measured they're
    normally one query every point three, four to point three, eight
    seconds. It's normally three queries a second, measured pace, and
    once you get to that 10 second time, which is approximately 30
    queries by nine, goes, nah. That's it. There is no answer. So if
    you're configuring. A name and getting it served on the net. How
    many name servers should you use? Well, we've gone through the
    first discussion that says three is better than four, sorry 3 3
    4s, is better than three, five is better than four, and you keep
    on going, but once you get to eight, well, eight is better than
    seven, isn't it? The answer is, well, from binds perspective, no,
    I'm not going to ask all eight Dude, you're wasting your time I'm
    only going to query a subset of these. So don't bother doing more
    than six or seven name servers for your name, because literally,
    the clients aren't going to look they just don't care. Now, that's
    bind. That's bind. There's another one, and it was adopted by
    default in the FreeBSD distribution. So it's not that common as bind.
    You know, it's not everywhere, but it's certainly well used. And
    that's unbound. And unbound comes from a different world. I set up
    eight name servers. It queries all eight, nine. It queries all 9
    13 it queries all 13. When does it give up? What if I get up a
    name server that has 13 unreachable name servers? So I set up this
    domain and I go query as much as you like, Dude, you never going
    to get an answer. So whereas bind says, Look, after 10 seconds,
    the world has moved on, not interested. I've got another life to
    lead. I've got other questions to answer. Stop this nonsense.
    Unbound. Goes, no, no, no, you asked me a question. And I noticed
    in doing these bench top tests, 500 seconds,  [George: no], 580
    seconds. And the worst I found was 1577 seconds.

George Michaelson  21:47

    It persisted in trying

Geoff Huston  21:49

    for 30 minutes over that period, did 149 queries. It's kind of,
    you gave me a question. I'm a computer. I'm not stopping. I'm just
    going to do this until, you know, hell freezes over, sort of Wow.
    Who is interested in an answer 30 minutes late?

George Michaelson  22:05

    Was it doing this asynchronously, having come back to the user
    with a failure state earlier than that?

Geoff Huston  22:11

    Oh, what actually goes on? Remember we said stub resolvers and
    recursive resolvers? Yeah. So the stub resolver sends a question
    to the recursive and don't forget, the recursive can't make things
    up. If it doesn't get an answer, it can't report back to the stub
    resolver. It doesn't report failure. You can't make up that that
    name doesn't exist. That's a lie. [George: Yeah]. So if all of
    these resolvers are unresponsive, for the Stubb resolvers
    perspective, the recursive resolver is unresponsive, and so the
    stub has its own fail safe. And most implementations go somewhere
    between six and 10 seconds, and it's normally around eight. It
    goes, That's it. I'm going to go back to the application whoever
    asked me to resolve this name, going, it doesn't resolve. That's
    not the same as the name doesn't exist. [George: Yeah], I didn't
    get told it doesn't exist. I just can't find that answer. It's a
    non existent name, but it's not that I can tell you definitively.
    What I can say is I can't find it. [George: Yeah], subtly
    different answer. So if I want resilience. I've got to bargain the
    fact in that stub resolvers and recursives tend to have a finite
    work sort of capability, and after that they simply go nah, nah,
    not going to do this. Gonna give up. Yep,

George Michaelson  23:37

    yep. This feels like an interesting quality in its own that we
    have objective tests, how far will you go? And we have at least
    two patterns of divergence, a time driven limit that sets a fairly
    concrete end, and another suite that implemented to the same
    behavior against an ill defined what to do and made a different
    choice. Are there other implementations out there? I mean, are
    there, Is there potential for a third way of dealing with this
    problem?

Geoff Huston  24:08

    Well, don't forget, the RFC said batting averages, it had no
    guidance. And so, in essence, every implementation, the
    implementer or the crew have been creative now, by and large,
    programmers aren't known as creative types for a very good reason.
    We're crap at it. In essence, this is a bit like NATs. As soon as
    you give folk some degree of latitude, they implement every known
    thing under the sun and every possible variation. So yes, folk do
    all kinds of odd things. So then, after these very simple bench
    top tests that kind of go, this is not what you originally
    thought. More is better for resilience, only up to a point, and
    while using batting averages I still can't get over that - sounds
    great for performance. Select the resolve that answers the fast
    fastest. The next question is, what actually happens time for Ta-
    Daa measurement?

George Michaelson  24:20

    Oh, good. I'm glad you brought that up. Geoff, given this is a
    measurement podcast,

Geoff Huston  25:19

    time for measurement. And so what we did was actually pretty
    simple. We set up four name servers. And I'll say right now,
    because I think it's giving away some of the answer. We're using
    what we call unicast. Each server is a machine in a point in the
    geography. One is in Atlanta in Georgia, one is in Frankfurt in
    Germany, one in Mumbai in India, and one in Singapore. And when
    you ask for a domain served by these four name servers, it will
    send you back the IP addresses, or effect, it will send you back
    the names of all those four servers, and v4 v6 so you'll translate
    those names into IP addresses, you'll end up with eight IP
    addresses, four in v4 four in v6 right? [George: Yeah]. So what I
    want to do is test millions of users as to how do they pick up a
    name server? Which one do they use? And we've talked before, APNIC
    run this ad based system with the generous assistance of Google.
    We use Google's ads to Enroll Users to do some simple web fetches.
    But to fetch a web object, you've got to resolve the domain name,
    and to resolve the domain name ta-daa, you've got to use the DNS
    and currently, we run about 25 million of these ads a day, and we
    pull in users from literally all over the planet because, you
    know, ads. So we have these four servers, and this time, rather
    than being unresponsive, because that's kind of a bit of an
    attack, really, they're responsive. They try and answer everything
    they get.

George Michaelson  26:57

    So you're not deliberately not answering from these these services
    that have been configured to be asked are going to do best effort
    service delivery.

Geoff Huston  27:07

    Yep, they're always going to answer. So we know which user got
    that ad, and we know from the query they ask which DNS server,
    which name server they're asking, and if batting averages mean
    performance, then the folk in East Asia, close to Singapore,
    should ask Singapore, in preference to say asking Frankfurt or
    Atlanta. The folk in Europe should ask the server in Atlanta and
    so on and so forth. But also just to make sure that each recursive
    resolver is kind of being honest. We would expect each recursive
    resolver to occasionally query any one of the other three, just to
    make sure that the one that it is selected is still the best
    batting average,

George Michaelson  28:00

    right? You would see a tendency to weight the traffic in a given
    geo location. Okay, good question about BGP to Geo mapping, but
    let's take it as read that there is this concept of close and for
    the things that you think are close to Singapore, you would expect
    most of the queries to go to Singapore. That means they've
    selected a good one. But equally, you would expect to see periodic
    attempts to validate that is the best one. They should be asking
    the others. But the intensity would be at a level which was, Are
    you any better? Not every single query I asked you,

Geoff Huston  28:39

    well, you're trying to be the best for most of the query. So as a
    rule of thumb, let's say, on a very, very intensively used
    recursive resolver, on a very intensively used name you'd expect
    it to ask the other three about, I don't know, once a minute, once
    every five minutes, but most of the time, it would keep on
    pounding away at the one name server that appears to be the
    fastest. It's not routing, it's not geo location, it's just the
    wall clock. How long did it take for this query to get answered?
    Whoever's fastest is the one I'll keep on sending queries to, and
    very occasionally, I'll send queries to the others. Yeah, [George:
    yep]. That's certainly logical, isn't it? So here we are with a
    name server. And let me repeat this, because it's kind of
    interesting, Mumbai in India, Frankfurt in Europe, Atlanta and
    North America, and Singapore in AsiaPac. And so we look at one of
    the largest or most intensively used recursive resolvers out
    there, the one run by Bharti Airtel in India. Interesting. They
    have a lot of people in India, a lot of Internet users. It's
    cheap, it's effective, and this recursive resolver handles a lot
    of queries. So when we look at the spread of these queries across
    our four name servers, what we expect to see is that the server in
    Mumbai should get hammered. The other three servers should get the
    occasional query, but certainly not being hammered. Okay. So the
    one in Atlanta, over geez, 12 hour period, got 87 queries. This is
    good. The one in Mumbai got 300,000 queries.

George Michaelson  30:25

    Okay, that seems good.

Geoff Huston  30:27

    The one in Frankfurt got 581,000 queries, almost double. And the
    one in Singapore, which is still further away than Mumbai, got
    611,000 queries. It's kind of Wow. So this resolver is certainly
    doing what we expect. Is asking all.

George Michaelson  30:46

    It's doing some of what we expect

Geoff Huston  30:50

    but it's not latching on to the one that we know is for them, the
    fastest to respond, and it we're looking at timings, the one in
    Mumbai is the fastest to respond. But this resolver is not picking
    it up and kind of putting all its queries there and only
    occasionally querying the other three. It's actually settled on
    the one in Singapore and on the one in Frankfurt as being the
    major preferred and a much lower intensity on the one that really
    is the fastest 8% of queries, and it's only Atlanta that gets
    almost none we know. Okay, interesting. We see a lot of recursive
    resolvers. Can we take the busiest of those and look at their
    signature? In other words, whom do they attach to? And is this
    common? And it's a lot of work. We found around 166,000 unique
    resolver IP addresses that were querying in our little form of
    measurement, and we needed a pretty intensive query rate. So we
    took the top 1000 resolver IP addresses who you know, between them
    varied from 36,000 queries over this 24 hour period, this was to
    1.5 million queries. So the ones that really did query like crazy,
    where you think that'd be bias towards performance, yeah, [George:
    yeah]. And we found that of those top 1000 only 616 only two
    thirds had what we called a strong attachment preference. In other
    words, they hit one of their four servers more than 60% of the
    time. So that's the first lesson. It's kind of 40% of these
    resolvers actually didn't really care about performance.

George Michaelson  32:38

    The thing is that it could be there's a subtle bug in the way this
    stuff works that hasn't been unpacked. Or it could be the tuning
    is just completely adrift from the reality of a network, and so
    whatever it's doing is maladapted to the behavior of a modern
    network. Or it could be some other thing, but I absolutely get 40%
    don't show a preference. And that is a huge number. That's huge,

Geoff Huston  33:06

    right? They kind of got their queries all over the place. And then
    everybody settle down and go, you resolve a number three. Sorry,
    name server. Number three, you're the fastest. You get all my
    queries. And then I'll just occasionally monitor the other three
    just to make sure that my choice is the right choice. That only
    happens 60% of the time. There's strong preference. Okay, so
    spreading things around the world to cater for most people most of
    the time doesn't work the way you think. But then comes the next
    question, which is also interesting, of the ones that show a
    strong attachment preference. How many are making the right
    choice? For example, we have this name server, the authority
    server, in Frankfurt, and we look at a recursive resolver operated
    by an ISP in France, free.fr it's pretty big ISP. It's got to be
    the other market share, [George: yeah]. And we go, Well, I see you
    use strong attachment preference. You really want to query one of
    these name servers. Whom are you querying? And it says, I'm using
    Atlanta. What? Hang on a second. You're using a name server that's
    95 milliseconds away from you, because we can measure it the other
    way with ping, and you're ignoring the one in Frankfurt, which is
    only 9.8 milliseconds, 10 times faster.

George Michaelson  34:27

    So you do see the tickling the testing queries. You see the low
    level of background where it says, What is the apparent response,
    or the belief is that's what it's doing. But you're not seeing it
    skew to preferring the local

Geoff Huston  34:40

    70% of the queries from this recursive resolver head to Atlanta,
    Georgia, and it's kind of, I can see you're making an attachment
    preference. You're preferring one name server. It's just you're
    making a dad choice, dude. And I go and look at all of those ones
    that show a strong attachment preference, right? And go. Well,
    hang on, how many of those are right have made the best selection
    and again, interesting answer. Only 40% of those resolvers have
    actually latched on to one of the four that ping shows is the
    closest is the fastest to response the rest, yeah, they kind of go
    well, if it's within 150 milliseconds of the other good enough.
    It's as if they're not actually counting actual time. They're
    counting in units of 1/6 of a second, [George yeah], or some other
    large number. And if two name servers happen to be in that same
    bucket, it kind of goes flip a coin. Who cares?

George Michaelson  35:42

    I was going to say this feels very strongly, a pick any stay
    there, random selection from some kind of aggregate you all fit
    into a set I'm going to call acceptable. Simply pick one, and then
    it never arrives at a circumstance where pick one picks a
    different one from that set. In effect, it's done the initial
    round robin, or whatever the mechanism is, and whichever one
    drifted to the top. That's okay. I'm staying with you.

    Speaker 1  36:13
    And the unit of Best, the unit of best appears to be a granularity
    of around a sixth of a second, or worse, a six of a second, yeah,
    about 150 milliseconds.

George Michaelson  36:24

    That is an amazing number.

Geoff Huston  36:27

    There have been a number of other working groups in the IETF, and
    one of them was, of course, QUIC. And indeed, there was a massive
    effort by Google that went under the code name of speedy, where
    they obsessed about the billion dollar millisecond, and they were
    trying to shave elements out of from click to response, trying to
    get their entire system to go faster. So QUIC tried to cut out the
    number of round trip times in setting up the secure session, we
    tried to up the initial window size in TCP. We've been doing
    everything we can possibly do to make the Internet faster and by
    faster. I'm not talking one second, 10 seconds or anything else.
    We are really talking milliseconds. The whole idea of CDNs, those
    content distribution networks, is to move the traffic closer to
    the user, so closer that things happen literally in milliseconds.

George Michaelson  37:28

    I was going to observe that across the same time the DNS has had
    this how do I pick the best we've actually had exactly the same
    drive implemented in switching hardware and in routing technology
    and in proxies and intermediaries all through the Internet, we
    actually have software systems which are doing things like
    fronting for a web server, saying, I need to select the best
    option for you, and if that one is non responsive, I need an
    alternate to you so that I can give you High Availability. The DNS
    is actually not that special anymore in needing to pick the best.
    But the interesting thing is, it might have been doing it for
    longer, and it might be using an older model of, how do I do this?

Geoff Huston  38:14

    Oh, it's using horses in a world of very fast speeding, you know,
    cars, or whatever it might be, it's a different generation. And I
    suppose the next question is, and this is, I suppose the question
    that fascinates me is, you seem to be measuring something that
    people don't care about, Geoff. Because if this was truly as bad
    as you're making out, and it is bad, then surely we would have
    fixed it. Surely the pressure would have been put back on. The
    recursive resolve is set to go, guys, better code, better code. Do
    this better but it's the Internet, and there are 50 ways to solve
    every problem, and on the Internet, we try all 50. And so then I
    started looking at, because I can, what do people actually do to
    set up their name service? And again, that's a fascinating
    question. You see, the whole idea of 13 root name servers started
    with this model of 13 machines in 13 points around the world. And
    what you were trying to do was get one of them at least as you
    know, as much as you could, close to most of the users. You use
    the one in America, I'll use the one in Australia. And in essence,
    everyone will get a fast, fast answer, as long as the recursive
    resolvers are doing their work. So it's pretty clear that
    recursive resolvers are doing a pretty crap job. So what do you do
    instead to make this better? Well, let's look at the root zone.
    And the root zone is kind of interesting. It contains 1445 top
    level domain names. If you actually look at the root zone
    contents, you'll find 5998 different name server names. On
    average, every name in the root zone has a little over four. Name
    servers.

George Michaelson  40:01

    So the entities that register high in the name space take their
    obligation to be available seriously, and they've settled on a
    number around four as the way to say, we expect this to be highly
    available. Four gets us there.

Geoff Huston  40:18

    Well, oddly enough, they 35% settle on four, 40% settle on six,
    and a whole bunch of others settle on other numbers, up to a
    maximum of 13. There's no agreement, but most names are either
    served by four name servers or six name servers. Interesting, but
    there's something else going on there inside those choice of four
    or six, and that is how many are actually unicast? Or have we all
    jumped over? The root itself has 13 different name servers by
    name. And up there there are approximately 26 13 v4 13 v6 26
    different IP addresses. But each IP address is anycast. It's
    actually injected into the routing system in a whole bunch of
    locations. I'm not sure I have the exact number, but the root
    zone, I think, has around 4000 separate machines.

George Michaelson  41:14

    At this point, we should probably observe anycast is one of those
    things where it's, in effect a statement, it's available at these
    anycast locations. And what it means is, I might be selecting one
    place as the closest one in BGP distance terms, to be the place I
    go and get it from, and you might be selecting another. There's no
    intermediary who at the point I asked the question, say you flick
    left, you flick right. It's just innate in the routing system.
    It's made an optimal in air quote selection for me, and if the one
    that was optimal goes away, and the routing system knows it finds
    another optimal one, and routing takes care of this decision. So
    best is sort of nobody's tasting the packet delay. This is just
    the routing system saying, Yeah, that one.

Geoff Huston  42:09

    Yeah that one. And on the whole, as long as you have, I suppose,
    enough variation, enough density of deployment, the routing system
    will do a pretty good job of getting you to the one that really is
    the lowest delay. Even though we don't rout on delay with a
    sufficiently dense anycast deployment, it'll end up sending you
    to the one that is the quickest to get to on the whole you need a
    pretty dense deployment, but you know, it'll kind of work, yeah,

George Michaelson  42:39

    yeah. It'll both be the fastest and it's reliable. It gets two
    goals out of this

Geoff Huston  42:46

    interesting. So I don't need to worry about how recursive
    resolvers perform or not perform if I serve my domain name using
    anycast system. Hmm, let's go back to that root zone which had
    those 1445 domain names with 5998 name servers, of which 5687
    names were dual stacked 321, were v4 only, and phenomenally, three
    of them were IPV6 only.

George Michaelson  43:19

    Brave, very brave.

Geoff Huston  43:21

    I thought so too. And the question is, and it's an interesting
    question, can we tell which of those addresses are actually any
    cast because whenever you sort of set up a measurement point, the
    routing system will get you to the closest one. You can't tell if
    there are more.

George Michaelson  43:38

    If you only have one point of measurement,

Geoff Huston  43:41

    right You can't say to the routing system, take me to the
    alternate. Any alternate? The answer is no, routing is one.

George Michaelson  43:48

    We do have multiple points of view into routing, so the collectors
    worldwide should be able to give you at least some belief. You
    know, some things are any cost.

Geoff Huston  43:59

    Well, we actually asked the DNS is one of the other things too,
    because in the DNS, there's this flag in a query that says, Tell
    me your unique identifier, your name server ID. And if you query
    in anycast, let's say you are where you are in one part of
    Australia. I'm in another part, and there's an anycast system in
    operation in the DNS that has a different server for you than the
    one that serves me. If both of us ask this IP address for the
    NSID, the name server ID, we'll get back a different answer.

George Michaelson  44:33

    If they are well behaved operators and they honor the principal,
    you shouldn't lie about that value.

Geoff Huston  44:40

    You shouldn't use a common NSID, because it'll make your debugging
    a nightmare as well as everyone else. So you know, don't do it.
    You're shooting your own foot off as well. So yes, let's just kind
    of assume that. So let's do both. Let's take our four points that
    we're measuring from and ask each of these 1000s of name servers.
    What their NSID is. And also do ping tests, because you see if it
    is just in one point in the globe, and we measure from four points
    all around the globe, we should see a pretty high variation in
    ping times, whereas, if the thing is, well, anycast, well, any
    cast, each of those four locations, Atlanta, Mumbai, Singapore,
    and you're in Frankfurt, should go, oh, this is right beside me.
    10 milliseconds, dude, it will appear to be close to all four at
    once.

George Michaelson  45:30

    Yeah, that seems reasonable. You've got four widely distributed
    points of measurement. The likelihood of a poorly designed any
    cast being equally distant from all of them is low. If it was
    weighted to Asia, the ones in Europe and America would see the
    difference. So in order to be well designed, it has to have
    sufficient coverage globally to be approximately the same distance
    cost to any of these. I mean, I could argue that eight would have
    been better than four. We're down in the weeds. The principle is
    sound. You should be able to tell,

Geoff Huston  46:01

    you should be able to tell. And so I put this I've got 9014
    different IP addresses, because I just ignore the four of these
    six. Go, well, they're just all IP address. So I test all 9000
    interesting 587 IP addresses for domains in the root system. 587
    are unicast, clearly unicast, same, NSID, no matter where you ask
    from, and the variance in round trip time is well over 150
    milliseconds. So obviously my packets are traveling to one point
    and where I'm far away from that point, no matter where it is, it
    will take me some time to get there. Now, well, what if I take
    this arbitrary number, 150 milliseconds, and go. If the maximum
    variance between those four pings is less than 150 milliseconds,
    I'll give your anycast system a big tick so different NSIDs and
    an RTT variance of less than 150 milliseconds, I will call you
    diverse Excellent. There are, of course, other folk who think that
    setting up an anycast system with one node in California and
    another node in Washington as a fine anycast system. It's not.
    It's just not

George Michaelson  47:13

    well, it is for your market. It is for your market in America, but
    I'm less sure for anywhere else.

Geoff Huston  47:18

    Yeah, the rest of the Internet kind of goes, you know, no. And for
    those folk who give me different NSIDs, but their round trip time
    variance is greater than 150 milliseconds, let's call them limited
    anycast. It's sort of there, but it's sort of not that good. So I
    go back to my root server and go, well, obviously the folk who
    have got themselves into the root zone want to maximize
    performance and resilience. And these days, if you're using
    unicast name servers, the recursive resolvers are actually doing
    an anti job. They're not helping. And so you've really, you're
    behind the eight ball. It's not working. But for eight TLDs,
    that's what they're still doing eight of them. That's kind of guys
    don't stop it. That's crazy. You've invested all this effort in
    being a top level domain, and your infrastructure for serving them
    is just,

George Michaelson  48:11

    I would have thought by now, the contract with ICANN would have
    required them to do a little better.

Geoff Huston  48:17

    No, no batting averages. There's no spec on this. There's nothing,
    no standard that anyone can point to, because no one has ever
    written this down as a specification. It's kind of word of mouth.

George Michaelson  48:29

    ICANN lawyers can't obligate them to do a thing that hasn't been
    well specified.

Geoff Huston  48:33

    Yes, it's equivalent to making stuff up, which, even for lawyers,
    is a little bit rough. So eight still do only unicast.
    Interestingly, of those 1400 or so, 378 actually have a mix of any
    cast and unicast. And I kind of wonder why. Because if your any
    cast system is good enough, you don't need unicast. [George:
    Yeah]. And the other 1067 interesting. 289 are doing an amazing
    job. They're only served by diverse anycast only. So 289 domain
    names are served by servers that are close to the major
    continents. They're in there, and they're being well served.
    Great. There's a similar number 202 that are served by very
    limited anycast. They're within a particular country or a
    particular region, and there's nothing remote. So it's kind of not
    very good anycast. It's giving poor performance. And the rest,
    576 are kind of a mixture of both wide and local, anycast,
    they're diverse. So now we start to sort of look a bit deeper into
    this and go well when I set up a name server in the root zone, do
    I use just one provider for my name servers? Or have I learned the
    lesson about resilience? Performance and use multiple providers
    right from different AS's using anycast,

George Michaelson  49:22

    they're getting some resilience against failure in routing plane
    by picking diverse operators of the routing mechanism that already
    provides diversity. It's not just I am anycast and have diverse
    platforms that are optimal in routing. It's I have multiple
    diverse platforms from multiple different BGP speakers. BGP
    configurers, giving me two levels of resilience,

Geoff Huston  50:36

    right? So there are only six TLDs in the root zone, where there's
    just one provider giving them their name servers. Five are
    unicast. Wow, that's crazy. One of them is diverse anycast, but
    they still only have one. So if that provider goes, you know,
    oops, I'm having a bad day today, they're gone. They're out. So
    there are only six of them that have gone with one. So how many is
    the right number? Well, I would actually argue two, because if you
    know, if you're diverse, the chances are

George Michaelson  51:08

Geoff. That's a very specific number. Why did you pick two and not 13?


Geoff Huston  51:13

    Because if both are down at the same time, it's a cosmically bad
    event, or it's you, it is extremely unlikely to be an act of a
    malicious deity to take those two out, and not all the rest. So if
    you're after realistic resilience, two is a fine number for any
    cast, because it's unlikely those two will be reliant on a single
    critical piece of infrastructure. You've got diversity and with
    one's down, the other still up, three, you're just wasting the
    money. And the way it works is there are about 600 of these top
    level domains that are served by two. 400 are served by four. Why
    four? Do enjoy the paperwork? Do you enjoy? What is going on?

George Michaelson  51:59

    The additional benefit in the logistical arithmetic you've
    performed suggests this is declining benefit. And it may be that
    it's performant, and it's a belief more is better, but there's
    actually no objective improvement,

Geoff Huston  52:13

    but there any cost. Yeah, it's not getting you anything better.
    There are 250 that are served by five and there are even 80 served
    by six different providers on average, each of them have 10
    distinct IP addresses, and we said before the most common
    recursive resolver bind goes six, seven, I give up. So overkill,
    extreme overkill. So what would I suggest now, if I try and wrap
    all this data together, DNS recursive resolvers don't do a good
    job. Don't count on them. Nothing is going to change, because the
    clients all use anycast. So it doesn't matter what the recursive
    resolver does in trying to select the best anymore, because we've
    moved on. So just accept it for what it is.

George Michaelson  53:03

    You're not arguing for a heavy weight process in standards to nail
    down what the recursives should do. You're actually arguing, let's
    pull the capability to do this out from the system and not require
    this because it's better done in a different place.

Geoff Huston  53:19

    I'm arguing that informally, the community have moved on. And if
    you're going to write the Bible of how to do it, you would say any
    cast, and you would simply say, Oh, guys in recursive resolve and
    continue on with your batting average. Fine concept. It just
    doesn't matter. Dude, it just doesn't matter.

George Michaelson  53:37

    I think I'd probably be slightly different in this Geoff. I heard
    a lot of magic numbers. There's a word I love here, reified, and
    it's when something which is an abstract quantity, like 13, is
    converted into some binding magical quality. It shall be 13. We
    know you and I know that 13 is because of belief around constrain
    in packet size bootstrapping DNS when it was UDP, 512, byte
    packets. But the fact remains that 150 milliseconds of observed
    period that puts things in the same bucket as good I'd probably
    say, let's dial that back and put it closer to 15 milliseconds.
    Let's make a distinguishing mark here in measurement that is a lot
    closer to light speed distance that stays within a continent or a
    technology basis. But the thing is, I'm falling into another sin.
    I'm believing I can fix it in technology. What you're doing is
    actually fixing it in politics, sociology and behavior you're
    saying, stop trying,

Geoff Huston  54:42

    yeah, and don't assume they do anything. So if they not doing the
    job, you have to do it with anycast. And anycast is a fine
    solution for the DNS. So the trick about anycast to make it
    really work is to pick high density, anycast constellations. I
    have ten nodes in my anycast service now go away.

George Michaelson  55:05

    I have 300 Oh, I'm going there.

Geoff Huston  55:08

    Even 300 is kind of getting there. I have 1000 it's kind of come
    on down. Let's talk. Because once you get to that point, don't
    forget, there are only 6000 transit networks on the planet.
    Everyone else is an end point. Once you have around 6000 points of
    presence, you are literally everywhere, [George: yeah], network
    wise, so dense anycast is where you want to go. And once you've
    got dense anycast, it doesn't matter what the recursive resolver
    does, you're right beside them. Hi, I'm here. Here's an answer
    limited anycast. I've got Italy covered top to bottom. Well, what
    about the rest of the world? Ah, I don't to care. I don't to care.
    Yeah, don't go live with anycast. Do the lot be everywhere.

George Michaelson  55:53

    And probably a corollary of this is, if you've gone with two
    anycast providers, think about why you have so many NS labels
    because you don't need them.

Geoff Huston  56:04

    Oh, God, you don't. You just don't need them. Two anycast labels,
    four IP addresses, two and four, two and six. You're done. Yeah.
    Anything else is just wasting everyone's time, including the stub
    resolvers. So anycast is good, the density, the better use them,
    and resilience is about multiple anycast platforms, and how many
    is enough? My own view is, I'm like, okay, Dijkstra once said, The
    Great, renowned computer scientist, there are only three numbers
    in computing. There's none, there's one, there's more than one,
    and I can see what he's saying. But I'm going to offer another
    number, one, more number, and it's not 13. I'm going to offer two,
    two as failover. And it's kind of as long as you pick your two
    carefully that they don't have a common point of dependency, then
    realistically, if you can't get to both of them, if both of them
    fail, you're the problem. It's you that's failed, not the outside
    world. So quite frankly, two highly diverse anycast platforms
    should see you through with serving a name with both resilience
    and performance. So I've looked at the root name service system
    and in the next piece of work, just to make sure that the
    computers we have at our fingertips are not sitting idly, just
    counting the time. I'm going to take the top 1 million domain
    names and perform a similar analysis on if you take what the world
    actually uses. Is this true for everyone, not just the names in
    the root name system. Do we all do anycast these days? [George:
    Yeah],

George Michaelson  57:38

    what's going on? I feel a follow up coming. Geoff, I think this is
    one for another day.

Geoff Huston  57:44

    It may well be. It's fascinating work to actually understand how
    much the DNS has evolved, almost without specification. It's a
    community that's kind of constantly talking to each other and
    looking what each other does, and moved on the scribes and the
    standards people are running a little behind going but it's sort
    of Nah, dude, we've moved on. We're doing it this way now.

George Michaelson  58:07

    Yeah, keep it simple. Nice work. Geoff, I think that's a really
    nice piece of measurement.

Geoff Huston  58:12

    Thank you. Thank you. And for those of you who hang around this
    long, thanks for following through.

George Michaelson  58:19

    If you've got a story or research to share here on ping. Why not
    get in contact by email to ping@apnic.net or via the APNIC social
    media channels. Also remember that the measurement@apnic.net
    mailing list on orbit is there to discuss and share relevant
    collaborative opportunities, grants and funding opportunities,
    jobs and graduate placings, or to seek feedback from the community
    on your own measurement projects, be sure to check out the APNIC
    website for All your resource and community needs until next time
    you

    Transcribed by https://otter.ai