Geoff Huston 0:00 So if you were writing an RFC today, 2025 and you were trying to talk about this, the IETF and its entire review process wouldn't let you get away with vague hand waving, they would nail you down and go right. You've got to talk about seconds, timers, selection algorithms. You've got to get it to the point where anybody who has some slight code capability could write basically the same code, [George: yeah]. But it was a different world. In November 1987 when Paul Mockapetris wrote RFC 1034 the concepts and facilities of the DNS, it was a very different world, [George: yeah]. And his standard said. Now I quote this because I think it's great. The sorting of name servers may involve statistics from past events, such as previous response times and batting averages. [George: Batting averages?] No, I know. I know Paul. He's not a cricketer, so I'm not exactly sure what he's referring to. Not being cricket. It must be something to do with baseball, I guess, batting averages, and that's the end of the advice George Michaelson 1:16 you're listening to ping, a podcast by APNIC discussing all things related to measuring the Internet, I'm your host, George Michaelson, this time, I'm talking to Geoff Huston from APNIC labs again in his first regular monthly spot on ping for 2025. Geoff and I discuss the DNS again, but this time looking at a quirk of resolver behavior. Resolvers are the part of the domain name system which perform queries on behalf of users. They're meant to consider the diversity of sources of authoritative information from the delegated name servers they're told about, and perform a sort of heuristic to periodically check that they're using the best one in terms of end to end delay. Geoff has been exploring both how this is defined and how this performs in practice using the APNIC labs measurement system. There are a few surprising outcomes from this study and a view to the future, which might be less about IETF standards and code changes and a lot more about DNS delegates making some wiser choices in who they use to run their authoritative DNS servers. Geoff, welcome to 2025 and welcome back to ping. What should we talk about this time? Geoff Huston 2:36 George, it's a pleasure to be back. And in continuing with almost a persistent theme from 2024 I have spent more time than is healthy for normal humans inside the domain name system, sort of moving slightly away from addresses and addressing infrastructure and routing across into the other area of DNS common infrastructure, The name space. And I must admit, the DNS is deceptively complex. It's, I think, been likened by many folk who put their brain through the DNS mill like a game of chess. The rules are deceptively simple. There are very few, but the combinations are mind bending, and the DNS is indeed a remarkably complex beast. George Michaelson 3:23 It has a lot of moving parts, it's true, and the fundamental playing pieces are not that hard to describe to people, but when you bang them together, the machinery gets weird. Geoff Huston 3:37 It gets weird. There's no common manual on what you must do to play in the DNS, different people software do subtly different things. There's different operational practices out there, and so it's kind of a loose consortium of folk who kind of vaguely play by kind of the same rules, sort of, and we, some of the time, obey a common protocol, and other times, we kind of push it into strange places. And in some ways, you go it's a miracle. It's just a miracle that it works at all. And every time you get an answer back to the DNS, you should be grateful, pathetically grateful, because you know what occurred was unnatural. It's not a tightly bound machine. And oddly enough, we sit there in the DNS and even the folk who build it with assumptions and mythology that date back almost 40 years, and other parts of it kind of change before our very eyes. And this mix is weird. So I want to start with a question. Why does the root name zone, the top of the DNS hierarchy, have 13 name servers? George Michaelson 4:52 Yes, not a number we're used to seeing in arbitrary choices of counts. We're used to the idea of one. Is everywhere. Three is everywhere. Five is not unusual. 10 is not unusual. 12 duodecimal. You and I are old enough to remember being trained in that weird, mechanistic way of counting. 13. That's not a common number Geoff Huston 5:14 1 2 4, 8 16, what's 13 doing? A bit like, you know, why did ATM choose 53 octets. It's kind of Wow. What were you smoking? What were you thinking to come up with a number like 13? And let me explain the thinking at the time, because this number 13 has entered mythology in the DNS and quite a few folk still use 13 name servers. You sit there and go. But the original idea was that what you wanted was both performance, speed and resilience. Now, how do you get resilience? Have more than one, so two is better than one. Three is better than two. Yeah, you could possibly apply this infinitely. George Michaelson 6:00 If you have a single point and you're looking for the natural thing to do to make it more resilient, having two ways of performing the function is the first go to but two very quickly become stale. So you want to have as many as you can. And in modern behavior, you actually try not to put a constraint on how many instances there are, although the act of saying, Get from the thing I want to the local instance itself can sometimes become a dependency that you don't like, however, that's off in the weeds. But you would not pick 13 as your first choice, would you? Geoff Huston 6:36 You wouldn't. But I suppose part of that thinking was 13 is better than 1214. Is better than 13. You know, you can keep on playing this game. But consider the thought experiment that instead of having 13 simultaneous outages out there somewhere on the net, you break the wire from your computer to all of the net, and you kind of have the suspicion that there's 100 name servers out there, and your poor blighted machine is going to ask 100 queries, 1 2 3, before it comes to the conclusion that nothings happening. So sometimes there is such a concept as too many, [George: yeah], because no one's that patient. And so 13 was partially a compromise between two is better than one. Let's keep adding and no one's that patient. Okay. Park that thought. But there was one other criteria that happened when this system was being set up in the 1980s and that was that these recursive resolvers, these people, these entities, these machines that asked the question were meant to examine their own performance. I just asked server number 52 and it took five seconds. I've just asked server number three, and it took 1/10th of a second. I'm going to keep on asking server number three, because obviously that's faster. And so the whole idea was you took these 13 servers and you spread them around the world. So if you're doing this properly, you'd kind of look at where there's population and put one here, one there, and evenly smear these unique 13 resolvers so that no one would be waiting for an eon to get an answer. Why is this important? Because we're using UDP, and UDP is a really, really odd algorithm. It doesn't say no, it doesn't say, I haven't got anything. You just don't get an answer because it's a datagram. George Michaelson 8:35 It's the protocol that doesn't have a protocol. It's send it and hope, and if up in the application layer, if you receive a response that in your application state makes sense, you like to believe the other end must have got the packet you sent, because otherwise, why did you get the answer? But there's no formalism of packet counting and tracking and state. Geoff Huston 8:58 When you get an answer, that's great. Go out on the street to celebrate. That's fantastic, but how long do you wait before you conclude that no answer is coming? George Michaelson 9:07 That's a problem for another day, said the network engineer and walks away from the keyboard. Geoff Huston 9:12 So performance matters, because if you really think, Oh, I'll wait for 10 seconds, I'm a very forgiving person, and I'm enormously, enormously patient. Then you send out a DNS query and watch the clock go tick, tick, tick, for 10 whole seconds before you ask the next server. How long would it take me to go through six servers an entire minute? Continents move in that time scale. We're not that patient, [George: yeah]. So the other part of resilience and UDP is actually latching pretty tightly on the server that's closest to you in this kind of world. So recursive resolvers were designed to be introspective when they ask for an authoritative name server. Hi, I've got a question. Question, and you're meant to be the entity that's going to answer me. It starts the clock when it sends the query, and when it gets back an answer, ya hoo. It says, Okay. It took you 50 milliseconds. It took you 100 milliseconds, whatever. And if a zone like the root zone is served by all 13 different name servers, it will run 13 clocks, and it will tend to ask the one that's fastest, but just to be on the safe side and to make sure if anything changes, it optimizes. It occasionally asks the others, just to make sure that if anything changes, it'll sort of move towards the fastest George Michaelson 10:40 now, as described as a sort of heuristic, because it isn't totally algorithm like this. Isn't such a bad mechanism, is it Geoff? You're going to not depend on one thing, and you're aware that things are physically distributed in the world and may have variable delay. And we've arrived at this magic number, by the way, I think we elided over why 13, but we can come back to that. And the thing is, you want to find the one you like best, and so the way to do it is to try all of them periodically. Bit of a question there. How often do you try a new one and decide which of them is fastest and make that the one you prefer. And you know, I think as a mechanistic view of how to pick the one you prefer, that's not too bad. Geoff Huston 11:28 So let's sort of wander sideways a bit and look a little bit at the language of RFCs, the cannons of the Internet technology, the standard specifications that are meant to be so good and so well written, we tell ourselves that independent implementations, looking solely at the RFCs, can produce inter operating code. The whole idea of the RFC was not to just do some paperware temple, if you did this ideal thing, everything would happen. It's people have used this specification, and they have built code, and it works with itself. It works with other implementations. So you'd think something as critical as performance and resilience of the DNS, you'd find a specification in the RFCs that really do get down to the heart of this. How often should you check other name servers? How many name servers do you need? George Michaelson 12:27 Yeah, so these are not the kinds of things that when you're writing a definitions document about behavior, you just now make up your own mind. You're looking for some commonality of expectation and behavior. You're going to nail down frequency and persistence and choice and algorithmic selection. Its left hand of the cow, visible from the train window, is brown material. Nobody says there is just the brown cow. They go as far as they can go. Geoff Huston 12:56 So if you were writing an RFC today, 2025 and you were trying to talk about this, the IETF and its entire review process wouldn't let you get away with vague hand waving. They would nail you down and go right. You've got to talk about seconds, timers, selection algorithms. You've got to get it to the point where anybody who has some slight code capability could write basically the same code, yeah. But it was a different world. In November 1987 when Paul Mockapetris wrote RFC 1034 the concepts and facilities of the DNS, it was a very different world. And his standard said, Now I quote this because I think it's great. The sorting of name servers may involve statistics from past events, such as previous response times and batting averages, George Michaelson 13:50 batting average. Geoff Huston 13:53 Now I know, I know Paul. He's not a cricketer, so I'm not exactly sure what he's referring to. Not being cricket, it must be something to do with baseball, I guess, batting averages, and that's the end of the advice a different world. George Michaelson 14:08 It's got to look like this on the wire. You are incredibly prescriptive. How you decide where it's going and why you picked. You have a number of ways of choosing that I've written in the margin of this document, and there is not enough space to detail Geoff Huston 14:22 all that kind of stuff. So I started to get curious about this, and I did a couple of bench tests. One of the most popular recursive resolve is quick digression. The DNS has a number of different components, typically produced by different people, and certainly operated by different parties inside your machine, whether it's a hand held device, laptop, whatever you want is what we call a stub resolver. It's a library that says, I need to resolve a name who's going to help me. And normally, when you boot up, you get configured with a set of the addresses of so called recursive resolve. Was provided by your ISP normally, but other people can do it. And your stub resolver on your machine goes and asks a recursive resolver. It's normally given two or more, because when it doesn't answer, you go and ask the other one resilience, and that's kind of the end of it. So you get stub resolvers. Recursive resolvers do all the work normally operated by your ISP. But other folk do it because either they're crazy or because they think it's a good thing, and they actually go out into the DNS and start doing the discovery thing. Who is the set of machines that are authoritative for the name that I'm after? We call them name servers, but you don't know in advance, because there's a lot of names, so you've actually got to discover which name servers to use, and that means a whole bunch. You start at the root query the root, who are the name servers for the next level down and so on. Let's not worry about that, but let's look at the behavior of this recursive resolver when given a domain that has, I don't know, 60 different name servers, how many will it query before it says I give up? Because I'm really concerned about resilience. I actually provision 60 authoritative name servers for my domain name because I really, really, really care that this domain has to be always there. George Michaelson 16:22 I have a number of ancillary questions about this, but you know, let's just go with it. If there is an insane number of listed name servers, how far do you go? Is a really good question. Is it infinite? Because that actually Geoff would be an attack, right? I could make a label and put 200,000 NSs in there. And if I could make every resolver in the world have to go down the list looking for 200 and 1001 I'd be wasting a lot of people's time. Geoff Huston 16:55 Yes. Mac the best answer is, give up. Just give up. So the most common recursive resolver code we think out there. It's kind of hard to do these censuses, but the most common, we think, is bind, originally, the Berkeley Internet name domain software or something. Bind nine is the most common one out there. And I test it bind nine, and you set up a domain with two name servers, it asks two and then says neither of those are actually working. Let that stop you. Set up five, and it'll ask all 5 1 2 3 4 5, starts to take a bit of time. It asks the five over a period of nine seconds. Set up six name servers, and in 9.6 seconds, it says none of those six were answering correctly. And you can watch it query all six, and I'll query each of them a few times just to make sure that they're really dead, just not responsive. I set up seven. It only queries Six. Eight. Only queries six. George Michaelson 17:52 You perform the experiment, and you demonstrated it does have a built in limit. Geoff Huston 17:57 It stops after 10 seconds. That's the first thing, it won't try forever. And secondly, this recursive resolver kind of goes, Look six. I did push it once into querying seven, and I thought, wow, success. But I had to configure 13 name servers for it to query seven of them. So if I'd set up 100 it would still go after 10 seconds. That's your answer. If I can't get six or seven to respond. That's it. So the recursive resolver kind of says, No, I'm not going to do this forever. So in essence, what's going on here is that there is an upper limit in some of these recursive resolvers that go, Look, time is not infinite. After 10 seconds, we're going to stop and I'm not going to query like a maniac, because that's a denial of service response. I'm going to sort of pace through evenly query every third of a second or so and re query selectively until I get to my time limit and that's it. I don't get an answer in 10 seconds, I'm going to say there is no answer. I've done what I can. It's gone. George Michaelson 19:00 So you're saying that the primary drive is the time bounds for the complete function to be performed. And if, on average, the timer to give up and move to the next one is of the order 10 seconds, it will naturally tend to hit six as a limit, because it's got an outer bound of a minute? Geoff Huston 19:21 No, no. You can set up the resolve at one millisecond away, and it will still be the same limit. It'll just do more queries. So it's kind of got two things. I'm only going to query six, possibly seven, because after that, there's no point. It's you, it's not the net. And secondly, there's an overall time limit for this exercise. After all these queries, they're measured they're normally one query every point three, four to point three, eight seconds. It's normally three queries a second, measured pace, and once you get to that 10 second time, which is approximately 30 queries by nine, goes, nah. That's it. There is no answer. So if you're configuring. A name and getting it served on the net. How many name servers should you use? Well, we've gone through the first discussion that says three is better than four, sorry 3 3 4s, is better than three, five is better than four, and you keep on going, but once you get to eight, well, eight is better than seven, isn't it? The answer is, well, from binds perspective, no, I'm not going to ask all eight Dude, you're wasting your time I'm only going to query a subset of these. So don't bother doing more than six or seven name servers for your name, because literally, the clients aren't going to look they just don't care. Now, that's bind. That's bind. There's another one, and it was adopted by default in the FreeBSD distribution. So it's not that common as bind. You know, it's not everywhere, but it's certainly well used. And that's unbound. And unbound comes from a different world. I set up eight name servers. It queries all eight, nine. It queries all 9 13 it queries all 13. When does it give up? What if I get up a name server that has 13 unreachable name servers? So I set up this domain and I go query as much as you like, Dude, you never going to get an answer. So whereas bind says, Look, after 10 seconds, the world has moved on, not interested. I've got another life to lead. I've got other questions to answer. Stop this nonsense. Unbound. Goes, no, no, no, you asked me a question. And I noticed in doing these bench top tests, 500 seconds, [George: no], 580 seconds. And the worst I found was 1577 seconds. George Michaelson 21:47 It persisted in trying Geoff Huston 21:49 for 30 minutes over that period, did 149 queries. It's kind of, you gave me a question. I'm a computer. I'm not stopping. I'm just going to do this until, you know, hell freezes over, sort of Wow. Who is interested in an answer 30 minutes late? George Michaelson 22:05 Was it doing this asynchronously, having come back to the user with a failure state earlier than that? Geoff Huston 22:11 Oh, what actually goes on? Remember we said stub resolvers and recursive resolvers? Yeah. So the stub resolver sends a question to the recursive and don't forget, the recursive can't make things up. If it doesn't get an answer, it can't report back to the stub resolver. It doesn't report failure. You can't make up that that name doesn't exist. That's a lie. [George: Yeah]. So if all of these resolvers are unresponsive, for the Stubb resolvers perspective, the recursive resolver is unresponsive, and so the stub has its own fail safe. And most implementations go somewhere between six and 10 seconds, and it's normally around eight. It goes, That's it. I'm going to go back to the application whoever asked me to resolve this name, going, it doesn't resolve. That's not the same as the name doesn't exist. [George: Yeah], I didn't get told it doesn't exist. I just can't find that answer. It's a non existent name, but it's not that I can tell you definitively. What I can say is I can't find it. [George: Yeah], subtly different answer. So if I want resilience. I've got to bargain the fact in that stub resolvers and recursives tend to have a finite work sort of capability, and after that they simply go nah, nah, not going to do this. Gonna give up. Yep, George Michaelson 23:37 yep. This feels like an interesting quality in its own that we have objective tests, how far will you go? And we have at least two patterns of divergence, a time driven limit that sets a fairly concrete end, and another suite that implemented to the same behavior against an ill defined what to do and made a different choice. Are there other implementations out there? I mean, are there, Is there potential for a third way of dealing with this problem? Geoff Huston 24:08 Well, don't forget, the RFC said batting averages, it had no guidance. And so, in essence, every implementation, the implementer or the crew have been creative now, by and large, programmers aren't known as creative types for a very good reason. We're crap at it. In essence, this is a bit like NATs. As soon as you give folk some degree of latitude, they implement every known thing under the sun and every possible variation. So yes, folk do all kinds of odd things. So then, after these very simple bench top tests that kind of go, this is not what you originally thought. More is better for resilience, only up to a point, and while using batting averages I still can't get over that - sounds great for performance. Select the resolve that answers the fast fastest. The next question is, what actually happens time for Ta- Daa measurement? George Michaelson 24:20 Oh, good. I'm glad you brought that up. Geoff, given this is a measurement podcast, Geoff Huston 25:19 time for measurement. And so what we did was actually pretty simple. We set up four name servers. And I'll say right now, because I think it's giving away some of the answer. We're using what we call unicast. Each server is a machine in a point in the geography. One is in Atlanta in Georgia, one is in Frankfurt in Germany, one in Mumbai in India, and one in Singapore. And when you ask for a domain served by these four name servers, it will send you back the IP addresses, or effect, it will send you back the names of all those four servers, and v4 v6 so you'll translate those names into IP addresses, you'll end up with eight IP addresses, four in v4 four in v6 right? [George: Yeah]. So what I want to do is test millions of users as to how do they pick up a name server? Which one do they use? And we've talked before, APNIC run this ad based system with the generous assistance of Google. We use Google's ads to Enroll Users to do some simple web fetches. But to fetch a web object, you've got to resolve the domain name, and to resolve the domain name ta-daa, you've got to use the DNS and currently, we run about 25 million of these ads a day, and we pull in users from literally all over the planet because, you know, ads. So we have these four servers, and this time, rather than being unresponsive, because that's kind of a bit of an attack, really, they're responsive. They try and answer everything they get. George Michaelson 26:57 So you're not deliberately not answering from these these services that have been configured to be asked are going to do best effort service delivery. Geoff Huston 27:07 Yep, they're always going to answer. So we know which user got that ad, and we know from the query they ask which DNS server, which name server they're asking, and if batting averages mean performance, then the folk in East Asia, close to Singapore, should ask Singapore, in preference to say asking Frankfurt or Atlanta. The folk in Europe should ask the server in Atlanta and so on and so forth. But also just to make sure that each recursive resolver is kind of being honest. We would expect each recursive resolver to occasionally query any one of the other three, just to make sure that the one that it is selected is still the best batting average, George Michaelson 28:00 right? You would see a tendency to weight the traffic in a given geo location. Okay, good question about BGP to Geo mapping, but let's take it as read that there is this concept of close and for the things that you think are close to Singapore, you would expect most of the queries to go to Singapore. That means they've selected a good one. But equally, you would expect to see periodic attempts to validate that is the best one. They should be asking the others. But the intensity would be at a level which was, Are you any better? Not every single query I asked you, Geoff Huston 28:39 well, you're trying to be the best for most of the query. So as a rule of thumb, let's say, on a very, very intensively used recursive resolver, on a very intensively used name you'd expect it to ask the other three about, I don't know, once a minute, once every five minutes, but most of the time, it would keep on pounding away at the one name server that appears to be the fastest. It's not routing, it's not geo location, it's just the wall clock. How long did it take for this query to get answered? Whoever's fastest is the one I'll keep on sending queries to, and very occasionally, I'll send queries to the others. Yeah, [George: yep]. That's certainly logical, isn't it? So here we are with a name server. And let me repeat this, because it's kind of interesting, Mumbai in India, Frankfurt in Europe, Atlanta and North America, and Singapore in AsiaPac. And so we look at one of the largest or most intensively used recursive resolvers out there, the one run by Bharti Airtel in India. Interesting. They have a lot of people in India, a lot of Internet users. It's cheap, it's effective, and this recursive resolver handles a lot of queries. So when we look at the spread of these queries across our four name servers, what we expect to see is that the server in Mumbai should get hammered. The other three servers should get the occasional query, but certainly not being hammered. Okay. So the one in Atlanta, over geez, 12 hour period, got 87 queries. This is good. The one in Mumbai got 300,000 queries. George Michaelson 30:25 Okay, that seems good. Geoff Huston 30:27 The one in Frankfurt got 581,000 queries, almost double. And the one in Singapore, which is still further away than Mumbai, got 611,000 queries. It's kind of Wow. So this resolver is certainly doing what we expect. Is asking all. George Michaelson 30:46 It's doing some of what we expect Geoff Huston 30:50 but it's not latching on to the one that we know is for them, the fastest to respond, and it we're looking at timings, the one in Mumbai is the fastest to respond. But this resolver is not picking it up and kind of putting all its queries there and only occasionally querying the other three. It's actually settled on the one in Singapore and on the one in Frankfurt as being the major preferred and a much lower intensity on the one that really is the fastest 8% of queries, and it's only Atlanta that gets almost none we know. Okay, interesting. We see a lot of recursive resolvers. Can we take the busiest of those and look at their signature? In other words, whom do they attach to? And is this common? And it's a lot of work. We found around 166,000 unique resolver IP addresses that were querying in our little form of measurement, and we needed a pretty intensive query rate. So we took the top 1000 resolver IP addresses who you know, between them varied from 36,000 queries over this 24 hour period, this was to 1.5 million queries. So the ones that really did query like crazy, where you think that'd be bias towards performance, yeah, [George: yeah]. And we found that of those top 1000 only 616 only two thirds had what we called a strong attachment preference. In other words, they hit one of their four servers more than 60% of the time. So that's the first lesson. It's kind of 40% of these resolvers actually didn't really care about performance. George Michaelson 32:38 The thing is that it could be there's a subtle bug in the way this stuff works that hasn't been unpacked. Or it could be the tuning is just completely adrift from the reality of a network, and so whatever it's doing is maladapted to the behavior of a modern network. Or it could be some other thing, but I absolutely get 40% don't show a preference. And that is a huge number. That's huge, Geoff Huston 33:06 right? They kind of got their queries all over the place. And then everybody settle down and go, you resolve a number three. Sorry, name server. Number three, you're the fastest. You get all my queries. And then I'll just occasionally monitor the other three just to make sure that my choice is the right choice. That only happens 60% of the time. There's strong preference. Okay, so spreading things around the world to cater for most people most of the time doesn't work the way you think. But then comes the next question, which is also interesting, of the ones that show a strong attachment preference. How many are making the right choice? For example, we have this name server, the authority server, in Frankfurt, and we look at a recursive resolver operated by an ISP in France, free.fr it's pretty big ISP. It's got to be the other market share, [George: yeah]. And we go, Well, I see you use strong attachment preference. You really want to query one of these name servers. Whom are you querying? And it says, I'm using Atlanta. What? Hang on a second. You're using a name server that's 95 milliseconds away from you, because we can measure it the other way with ping, and you're ignoring the one in Frankfurt, which is only 9.8 milliseconds, 10 times faster. George Michaelson 34:27 So you do see the tickling the testing queries. You see the low level of background where it says, What is the apparent response, or the belief is that's what it's doing. But you're not seeing it skew to preferring the local Geoff Huston 34:40 70% of the queries from this recursive resolver head to Atlanta, Georgia, and it's kind of, I can see you're making an attachment preference. You're preferring one name server. It's just you're making a dad choice, dude. And I go and look at all of those ones that show a strong attachment preference, right? And go. Well, hang on, how many of those are right have made the best selection and again, interesting answer. Only 40% of those resolvers have actually latched on to one of the four that ping shows is the closest is the fastest to response the rest, yeah, they kind of go well, if it's within 150 milliseconds of the other good enough. It's as if they're not actually counting actual time. They're counting in units of 1/6 of a second, [George yeah], or some other large number. And if two name servers happen to be in that same bucket, it kind of goes flip a coin. Who cares? George Michaelson 35:42 I was going to say this feels very strongly, a pick any stay there, random selection from some kind of aggregate you all fit into a set I'm going to call acceptable. Simply pick one, and then it never arrives at a circumstance where pick one picks a different one from that set. In effect, it's done the initial round robin, or whatever the mechanism is, and whichever one drifted to the top. That's okay. I'm staying with you. Speaker 1 36:13 And the unit of Best, the unit of best appears to be a granularity of around a sixth of a second, or worse, a six of a second, yeah, about 150 milliseconds. George Michaelson 36:24 That is an amazing number. Geoff Huston 36:27 There have been a number of other working groups in the IETF, and one of them was, of course, QUIC. And indeed, there was a massive effort by Google that went under the code name of speedy, where they obsessed about the billion dollar millisecond, and they were trying to shave elements out of from click to response, trying to get their entire system to go faster. So QUIC tried to cut out the number of round trip times in setting up the secure session, we tried to up the initial window size in TCP. We've been doing everything we can possibly do to make the Internet faster and by faster. I'm not talking one second, 10 seconds or anything else. We are really talking milliseconds. The whole idea of CDNs, those content distribution networks, is to move the traffic closer to the user, so closer that things happen literally in milliseconds. George Michaelson 37:28 I was going to observe that across the same time the DNS has had this how do I pick the best we've actually had exactly the same drive implemented in switching hardware and in routing technology and in proxies and intermediaries all through the Internet, we actually have software systems which are doing things like fronting for a web server, saying, I need to select the best option for you, and if that one is non responsive, I need an alternate to you so that I can give you High Availability. The DNS is actually not that special anymore in needing to pick the best. But the interesting thing is, it might have been doing it for longer, and it might be using an older model of, how do I do this? Geoff Huston 38:14 Oh, it's using horses in a world of very fast speeding, you know, cars, or whatever it might be, it's a different generation. And I suppose the next question is, and this is, I suppose the question that fascinates me is, you seem to be measuring something that people don't care about, Geoff. Because if this was truly as bad as you're making out, and it is bad, then surely we would have fixed it. Surely the pressure would have been put back on. The recursive resolve is set to go, guys, better code, better code. Do this better but it's the Internet, and there are 50 ways to solve every problem, and on the Internet, we try all 50. And so then I started looking at, because I can, what do people actually do to set up their name service? And again, that's a fascinating question. You see, the whole idea of 13 root name servers started with this model of 13 machines in 13 points around the world. And what you were trying to do was get one of them at least as you know, as much as you could, close to most of the users. You use the one in America, I'll use the one in Australia. And in essence, everyone will get a fast, fast answer, as long as the recursive resolvers are doing their work. So it's pretty clear that recursive resolvers are doing a pretty crap job. So what do you do instead to make this better? Well, let's look at the root zone. And the root zone is kind of interesting. It contains 1445 top level domain names. If you actually look at the root zone contents, you'll find 5998 different name server names. On average, every name in the root zone has a little over four. Name servers. George Michaelson 40:01 So the entities that register high in the name space take their obligation to be available seriously, and they've settled on a number around four as the way to say, we expect this to be highly available. Four gets us there. Geoff Huston 40:18 Well, oddly enough, they 35% settle on four, 40% settle on six, and a whole bunch of others settle on other numbers, up to a maximum of 13. There's no agreement, but most names are either served by four name servers or six name servers. Interesting, but there's something else going on there inside those choice of four or six, and that is how many are actually unicast? Or have we all jumped over? The root itself has 13 different name servers by name. And up there there are approximately 26 13 v4 13 v6 26 different IP addresses. But each IP address is anycast. It's actually injected into the routing system in a whole bunch of locations. I'm not sure I have the exact number, but the root zone, I think, has around 4000 separate machines. George Michaelson 41:14 At this point, we should probably observe anycast is one of those things where it's, in effect a statement, it's available at these anycast locations. And what it means is, I might be selecting one place as the closest one in BGP distance terms, to be the place I go and get it from, and you might be selecting another. There's no intermediary who at the point I asked the question, say you flick left, you flick right. It's just innate in the routing system. It's made an optimal in air quote selection for me, and if the one that was optimal goes away, and the routing system knows it finds another optimal one, and routing takes care of this decision. So best is sort of nobody's tasting the packet delay. This is just the routing system saying, Yeah, that one. Geoff Huston 42:09 Yeah that one. And on the whole, as long as you have, I suppose, enough variation, enough density of deployment, the routing system will do a pretty good job of getting you to the one that really is the lowest delay. Even though we don't rout on delay with a sufficiently dense anycast deployment, it'll end up sending you to the one that is the quickest to get to on the whole you need a pretty dense deployment, but you know, it'll kind of work, yeah, George Michaelson 42:39 yeah. It'll both be the fastest and it's reliable. It gets two goals out of this Geoff Huston 42:46 interesting. So I don't need to worry about how recursive resolvers perform or not perform if I serve my domain name using anycast system. Hmm, let's go back to that root zone which had those 1445 domain names with 5998 name servers, of which 5687 names were dual stacked 321, were v4 only, and phenomenally, three of them were IPV6 only. George Michaelson 43:19 Brave, very brave. Geoff Huston 43:21 I thought so too. And the question is, and it's an interesting question, can we tell which of those addresses are actually any cast because whenever you sort of set up a measurement point, the routing system will get you to the closest one. You can't tell if there are more. George Michaelson 43:38 If you only have one point of measurement, Geoff Huston 43:41 right You can't say to the routing system, take me to the alternate. Any alternate? The answer is no, routing is one. George Michaelson 43:48 We do have multiple points of view into routing, so the collectors worldwide should be able to give you at least some belief. You know, some things are any cost. Geoff Huston 43:59 Well, we actually asked the DNS is one of the other things too, because in the DNS, there's this flag in a query that says, Tell me your unique identifier, your name server ID. And if you query in anycast, let's say you are where you are in one part of Australia. I'm in another part, and there's an anycast system in operation in the DNS that has a different server for you than the one that serves me. If both of us ask this IP address for the NSID, the name server ID, we'll get back a different answer. George Michaelson 44:33 If they are well behaved operators and they honor the principal, you shouldn't lie about that value. Geoff Huston 44:40 You shouldn't use a common NSID, because it'll make your debugging a nightmare as well as everyone else. So you know, don't do it. You're shooting your own foot off as well. So yes, let's just kind of assume that. So let's do both. Let's take our four points that we're measuring from and ask each of these 1000s of name servers. What their NSID is. And also do ping tests, because you see if it is just in one point in the globe, and we measure from four points all around the globe, we should see a pretty high variation in ping times, whereas, if the thing is, well, anycast, well, any cast, each of those four locations, Atlanta, Mumbai, Singapore, and you're in Frankfurt, should go, oh, this is right beside me. 10 milliseconds, dude, it will appear to be close to all four at once. George Michaelson 45:30 Yeah, that seems reasonable. You've got four widely distributed points of measurement. The likelihood of a poorly designed any cast being equally distant from all of them is low. If it was weighted to Asia, the ones in Europe and America would see the difference. So in order to be well designed, it has to have sufficient coverage globally to be approximately the same distance cost to any of these. I mean, I could argue that eight would have been better than four. We're down in the weeds. The principle is sound. You should be able to tell, Geoff Huston 46:01 you should be able to tell. And so I put this I've got 9014 different IP addresses, because I just ignore the four of these six. Go, well, they're just all IP address. So I test all 9000 interesting 587 IP addresses for domains in the root system. 587 are unicast, clearly unicast, same, NSID, no matter where you ask from, and the variance in round trip time is well over 150 milliseconds. So obviously my packets are traveling to one point and where I'm far away from that point, no matter where it is, it will take me some time to get there. Now, well, what if I take this arbitrary number, 150 milliseconds, and go. If the maximum variance between those four pings is less than 150 milliseconds, I'll give your anycast system a big tick so different NSIDs and an RTT variance of less than 150 milliseconds, I will call you diverse Excellent. There are, of course, other folk who think that setting up an anycast system with one node in California and another node in Washington as a fine anycast system. It's not. It's just not George Michaelson 47:13 well, it is for your market. It is for your market in America, but I'm less sure for anywhere else. Geoff Huston 47:18 Yeah, the rest of the Internet kind of goes, you know, no. And for those folk who give me different NSIDs, but their round trip time variance is greater than 150 milliseconds, let's call them limited anycast. It's sort of there, but it's sort of not that good. So I go back to my root server and go, well, obviously the folk who have got themselves into the root zone want to maximize performance and resilience. And these days, if you're using unicast name servers, the recursive resolvers are actually doing an anti job. They're not helping. And so you've really, you're behind the eight ball. It's not working. But for eight TLDs, that's what they're still doing eight of them. That's kind of guys don't stop it. That's crazy. You've invested all this effort in being a top level domain, and your infrastructure for serving them is just, George Michaelson 48:11 I would have thought by now, the contract with ICANN would have required them to do a little better. Geoff Huston 48:17 No, no batting averages. There's no spec on this. There's nothing, no standard that anyone can point to, because no one has ever written this down as a specification. It's kind of word of mouth. George Michaelson 48:29 ICANN lawyers can't obligate them to do a thing that hasn't been well specified. Geoff Huston 48:33 Yes, it's equivalent to making stuff up, which, even for lawyers, is a little bit rough. So eight still do only unicast. Interestingly, of those 1400 or so, 378 actually have a mix of any cast and unicast. And I kind of wonder why. Because if your any cast system is good enough, you don't need unicast. [George: Yeah]. And the other 1067 interesting. 289 are doing an amazing job. They're only served by diverse anycast only. So 289 domain names are served by servers that are close to the major continents. They're in there, and they're being well served. Great. There's a similar number 202 that are served by very limited anycast. They're within a particular country or a particular region, and there's nothing remote. So it's kind of not very good anycast. It's giving poor performance. And the rest, 576 are kind of a mixture of both wide and local, anycast, they're diverse. So now we start to sort of look a bit deeper into this and go well when I set up a name server in the root zone, do I use just one provider for my name servers? Or have I learned the lesson about resilience? Performance and use multiple providers right from different AS's using anycast, George Michaelson 49:22 they're getting some resilience against failure in routing plane by picking diverse operators of the routing mechanism that already provides diversity. It's not just I am anycast and have diverse platforms that are optimal in routing. It's I have multiple diverse platforms from multiple different BGP speakers. BGP configurers, giving me two levels of resilience, Geoff Huston 50:36 right? So there are only six TLDs in the root zone, where there's just one provider giving them their name servers. Five are unicast. Wow, that's crazy. One of them is diverse anycast, but they still only have one. So if that provider goes, you know, oops, I'm having a bad day today, they're gone. They're out. So there are only six of them that have gone with one. So how many is the right number? Well, I would actually argue two, because if you know, if you're diverse, the chances are George Michaelson 51:08 Geoff. That's a very specific number. Why did you pick two and not 13? Geoff Huston 51:13 Because if both are down at the same time, it's a cosmically bad event, or it's you, it is extremely unlikely to be an act of a malicious deity to take those two out, and not all the rest. So if you're after realistic resilience, two is a fine number for any cast, because it's unlikely those two will be reliant on a single critical piece of infrastructure. You've got diversity and with one's down, the other still up, three, you're just wasting the money. And the way it works is there are about 600 of these top level domains that are served by two. 400 are served by four. Why four? Do enjoy the paperwork? Do you enjoy? What is going on? George Michaelson 51:59 The additional benefit in the logistical arithmetic you've performed suggests this is declining benefit. And it may be that it's performant, and it's a belief more is better, but there's actually no objective improvement, Geoff Huston 52:13 but there any cost. Yeah, it's not getting you anything better. There are 250 that are served by five and there are even 80 served by six different providers on average, each of them have 10 distinct IP addresses, and we said before the most common recursive resolver bind goes six, seven, I give up. So overkill, extreme overkill. So what would I suggest now, if I try and wrap all this data together, DNS recursive resolvers don't do a good job. Don't count on them. Nothing is going to change, because the clients all use anycast. So it doesn't matter what the recursive resolver does in trying to select the best anymore, because we've moved on. So just accept it for what it is. George Michaelson 53:03 You're not arguing for a heavy weight process in standards to nail down what the recursives should do. You're actually arguing, let's pull the capability to do this out from the system and not require this because it's better done in a different place. Geoff Huston 53:19 I'm arguing that informally, the community have moved on. And if you're going to write the Bible of how to do it, you would say any cast, and you would simply say, Oh, guys in recursive resolve and continue on with your batting average. Fine concept. It just doesn't matter. Dude, it just doesn't matter. George Michaelson 53:37 I think I'd probably be slightly different in this Geoff. I heard a lot of magic numbers. There's a word I love here, reified, and it's when something which is an abstract quantity, like 13, is converted into some binding magical quality. It shall be 13. We know you and I know that 13 is because of belief around constrain in packet size bootstrapping DNS when it was UDP, 512, byte packets. But the fact remains that 150 milliseconds of observed period that puts things in the same bucket as good I'd probably say, let's dial that back and put it closer to 15 milliseconds. Let's make a distinguishing mark here in measurement that is a lot closer to light speed distance that stays within a continent or a technology basis. But the thing is, I'm falling into another sin. I'm believing I can fix it in technology. What you're doing is actually fixing it in politics, sociology and behavior you're saying, stop trying, Geoff Huston 54:42 yeah, and don't assume they do anything. So if they not doing the job, you have to do it with anycast. And anycast is a fine solution for the DNS. So the trick about anycast to make it really work is to pick high density, anycast constellations. I have ten nodes in my anycast service now go away. George Michaelson 55:05 I have 300 Oh, I'm going there. Geoff Huston 55:08 Even 300 is kind of getting there. I have 1000 it's kind of come on down. Let's talk. Because once you get to that point, don't forget, there are only 6000 transit networks on the planet. Everyone else is an end point. Once you have around 6000 points of presence, you are literally everywhere, [George: yeah], network wise, so dense anycast is where you want to go. And once you've got dense anycast, it doesn't matter what the recursive resolver does, you're right beside them. Hi, I'm here. Here's an answer limited anycast. I've got Italy covered top to bottom. Well, what about the rest of the world? Ah, I don't to care. I don't to care. Yeah, don't go live with anycast. Do the lot be everywhere. George Michaelson 55:53 And probably a corollary of this is, if you've gone with two anycast providers, think about why you have so many NS labels because you don't need them. Geoff Huston 56:04 Oh, God, you don't. You just don't need them. Two anycast labels, four IP addresses, two and four, two and six. You're done. Yeah. Anything else is just wasting everyone's time, including the stub resolvers. So anycast is good, the density, the better use them, and resilience is about multiple anycast platforms, and how many is enough? My own view is, I'm like, okay, Dijkstra once said, The Great, renowned computer scientist, there are only three numbers in computing. There's none, there's one, there's more than one, and I can see what he's saying. But I'm going to offer another number, one, more number, and it's not 13. I'm going to offer two, two as failover. And it's kind of as long as you pick your two carefully that they don't have a common point of dependency, then realistically, if you can't get to both of them, if both of them fail, you're the problem. It's you that's failed, not the outside world. So quite frankly, two highly diverse anycast platforms should see you through with serving a name with both resilience and performance. So I've looked at the root name service system and in the next piece of work, just to make sure that the computers we have at our fingertips are not sitting idly, just counting the time. I'm going to take the top 1 million domain names and perform a similar analysis on if you take what the world actually uses. Is this true for everyone, not just the names in the root name system. Do we all do anycast these days? [George: Yeah], George Michaelson 57:38 what's going on? I feel a follow up coming. Geoff, I think this is one for another day. Geoff Huston 57:44 It may well be. It's fascinating work to actually understand how much the DNS has evolved, almost without specification. It's a community that's kind of constantly talking to each other and looking what each other does, and moved on the scribes and the standards people are running a little behind going but it's sort of Nah, dude, we've moved on. We're doing it this way now. George Michaelson 58:07 Yeah, keep it simple. Nice work. Geoff, I think that's a really nice piece of measurement. Geoff Huston 58:12 Thank you. Thank you. And for those of you who hang around this long, thanks for following through. George Michaelson 58:19 If you've got a story or research to share here on ping. Why not get in contact by email to ping@apnic.net or via the APNIC social media channels. Also remember that the measurement@apnic.net mailing list on orbit is there to discuss and share relevant collaborative opportunities, grants and funding opportunities, jobs and graduate placings, or to seek feedback from the community on your own measurement projects, be sure to check out the APNIC website for All your resource and community needs until next time you Transcribed by https://otter.ai