Duane Wessels 0:00 As Moritz said at the start, if we talk about the root server system and all of its instances and anycast sites and whatnot, it's on the order of 2000 say, today, right? Which is a lot. And then if you look at the RSSAC 047, Specification for how you build this measurement platform, it specifies 20 vantage points. That's two orders of magnitude smaller. That I think it's unfortunate in a way, because it does come down to money. You know, how much money are you willing to spend on a measurement platform and these vantage points? And I certainly think that, you know, the more vantage points you have, the higher quality measurements you're going to get. George Michaelson 0:45 you're listening to ping, a podcast by APNIC, discussing all things related to measuring the Internet. I'm your host, George Michaelson, this time, I'm talking to Duane Wessels from VeriSign again, along with Moritz Muller from SIDN labs in the Netherlands. It's a strangely recursive conversation. We're talking about measuring measurements. We discuss RSSAC 047, documented activity of the root server systems advisory committee to understand four critical qualities in root DNS operations. The four measurements and performance metrics are carried out against all 13 route labels using systems developed by ICANN. VeriSign and ISC commissioned SIDN labs to look at the measurement and in effect, measure how well it was measuring. The results are interesting and point to some issues in software and in the nature of monitoring large distributed systems. Moritz Duane, welcome to ping. Could you introduce yourselves and tell us a little bit about what you do? Moritz, you first Moritz Müller 1:52 Sure? Thanks for having us. My name is Moritz Muller. I work for SIDN, which is the registry for .NL. I work in the research department there at SIDN labs, and I'm also affiliated with the University of Twente, and I'm also one of the co chairs of the DNS Working Group at RIPE. Duane Wessels 2:08 And I'm Duane Wessels, thanks for having me back George I work at VeriSign, where I do a lot of work on root zone, root servers and data analysis and measurements. So thanks for having me back. George Michaelson 2:21 Moritz, you recently authored a blog article at APNIC that's about monitoring distributed DNS deployments. In fact, the highly distributed ones typically things like the root service, where there are hundreds and possibly 1000s of nodes. And I wondered if we could talk a little bit about that. This was something that came up as a result of a request from VeriSign and ISC, is that right Duane? Duane Wessels 2:48 We'll be talking today, I think about a document called RSSAC 047, which describes various measurements and performance metrics for root servers that work was done in ICANN over the last few years, and VeriSign ISC asked SIDN and their collaborators to sort of check the results that we were seeing from those measurements. And that's what Moritz is here to tell us about. George Michaelson 3:10 And the collaborators would be NLnet labs. Moritz. Is that right, Moritz Müller 3:14 Correct George Michaelson 3:15 This is quite an interesting moment, because measurement something that we all do. I mean, that's what ping is all about. But this quality that people define a set of measurements, but then you kind of want to sit back and say, Hmm, have we got this right? I think that's really very interesting. So you actually did some practical activities, didn't you both to assess whether the measures could be made and also their effectiveness? Could you talk a little bit about that? Moritz Müller 3:40 Sure, the measurements themselves have different goals to measure different performance metrics of the root service. So there are already two different aspects to it. There's the measurement part, and then aggregating these measurements to come to a certain metric that then tells people whether the root server system performs in a certain way. There are, of course, many different interesting aspects to that, because the root server system is highly distributed. It has 13 different servers, which are distributed over across many, many different sites, I think almost 2000 there are nowadays. And measuring that reliably is apparently quite challenging, and we had a look into whether the measurement system that is already being operated in a test mode by ICANN achieves the goal of measuring the root server system in a reliable way, and also whether the metrics actually reflect what has been measured. George Michaelson 4:35 So ICANN developed their own measurement software. I think that's available in the public domain, isn't it? You can download that. Moritz Müller 4:43 That's correct. So I think they took the RSSAC 047, document and wrote a quite literal implementation of this document, and then they set up this software on a different number of nodes, on 20 different modes. And they perform measurements every five minutes towards all the different root servers across IPV4, IPV6 and UDP TCP. And can think about SOA queries, for example, to check whether the root servers respond with a certain zone file for example, George Michaelson 5:18 Duane, can you tell us a little bit about the goals of this exercise, because our RSSAC 047, it says metrics at the root. So I'm guessing there's like a definitional set of things here that people are trying to capture Duane Wessels 5:31 sure that document specifies four types of measurements or metrics that are applied both to individual root server identities, like A root, B root, and to the root server system as a whole. So those are availability, request latency, correctness, and then what we call publication latency, ensuring that the root servers are serving sort of an up to date version of the root zone. George Michaelson 5:56 This system is quite complex in some ways, because you have broadly independent agencies running their own choice of software, their own choice of routing infrastructure, their own choice of hardware, and they all have to try and deliver as much as possible the same thing, but the same thing has this other dimension that it's constantly being updated. So if you imagine someone like J root that you operate as VeriSign. You have hundreds of machines that have to offer this zone, file state, which itself is constantly changing, so you have to find a way to cohere and eventually be consistent for a version. But outside of that, all the other letters, A, root, B, root, L, root, M root also have to converge to the same version of the zone that you're serving. It's like two dimensions of convergence, isn't it? Duane Wessels 6:48 It is. You know, one thing that makes it sort of, I don't wanna say easy, but, but in the case of the root zone, you know, the root zone doesn't change, really all that often. Generally, about two times a day a new version is published. So it's pretty stable, and it doesn't take, usually, it doesn't take very long for a new zone to propagate out to all of the root servers and all of their instances. That usually happens pretty seamlessly and pretty quick. But as we'll hear, you know, that's why this document is this. We need to measure it right. I can't just say that it works that way. We need measurements to show that it's working the way we will. George Michaelson 7:16 There's a set of four top level measures, and I'm guessing there are actually much more fine grain specific things that are done to enact the measurement against those criteria. Moritz, could you give us an example of the kind of things that are done? Moritz Müller 7:30 So, for example, for the response time measurements, all the different vantage points that are being used in the measurement platform, send an SOA query to all the different root servers every five minutes, and then basically takes the time how long it takes for the root servers to respond. It does it, as I mentioned, across IP four, IPV6 and UDP TCP, and it then uses these measurements across a period of a whole month to calculate the response time for root server letter, so A to M root and also root server system as a whole. And it takes a certain formula to do that as well, and then it reports, basically in the end, whether the root server letter or the root server system as a whole met a certain availability threshold. And next to that, there are also traceroute measurements that are also performed from each vantage point to the root servers. And there it tries to test whether I guess the network path towards the root servers are functional. George Michaelson 8:33 Root servers, if I remember correctly, have the option of actually giving ID information in a specific query. In general, they're trying to look like just an anonymous root server, but they actually can say, literally, I'm the one at LAX, or I'm the one in Dusseldorf, can't they? So were you also querying to understand which specific server you had successfully connected to? Moritz Müller 8:57 Interestingly enough, the initial implementation did send out a flag or an EDNS option to also ask for the NSID information by the server, but the system itself did not use this information. So when we looked deeper into the measurements, we also made sure that this information is being processed so we could be sure which service actually porting, for example, a high response latency, which then allowed us to dig a bit deeper into why maybe certain timeouts appear occurred, and why others performed well. George Michaelson 8:57 So in broad brush, sense, it's like a, I mean, I'm expecting, as a consumer of DNS services. There's somewhere I can go and look, there's a dashboard to see this behavior? Or is this a report that's going into ICANN process as sort of a contract compliance thing? Because it would be incredibly interesting to be exposed to this measure. Is there a view of this? Duane Wessels 9:15 There is not a view, I think, the way that you're thinking of George The impetus behind RSSAC 047 was really driven by some ongoing work in ICANN on Root Services and governance, and that is still ongoing, right? But 047 doesn't provide like a dashboard, as Moritz said, it provides monthly reports, and those reports are simply sort of pass/fail on these, on these various metrics. So it's, [George: yeah] one of the specific non goals of 047, was, was not to, like, compare the different route servers like to say, you know, B is better than C, but it's just by pass fail for all of them. George Michaelson 10:31 Yeah. So it's not a beauty contest or a ranking. It's a compliance thing. Maybe there's a future state where this can start to be part of the public reporting. But I think there's a trend in industry at large that when people start saying, we want to look at the governance questions, it's a really good idea to give the community an opportunity to lay down some tracks for what are the baseline expectations and behaviors. So this feels to me like the way things work in general. You kind of want this to happen, and it's good that you guys came together and defined a set of measures. So two things going on, a measurement activity, and a bunch of people measuring the measurement. So Moritz, what did you find? Moritz Müller 11:13 Yeah, so VeriSign and ISC approached us, because I think they, and maybe other root server operators, stumbled upon two things. And the first thing was probably that the reports noted that the root server system as a whole did not perform as good as expected. So the root server system did not meet the availability threshold on multiple occasions and multiple months, which I guess came as a surprise, at least to us as researchers, because the root server system is highly distributed, and people know what they doing. So this came as a surprise, and a second thing that popped up while we were doing this research was that one of the root server letters did not publish on time a number of root zones, and this is what we kind of studied. We want to understand whether the delays were actually caused by the root server system, or whether they were caused maybe by some flaw in the measurement platform. And we also want to understand whether the measurement platform actually noticed that one of the root letters did not publish the root zones on time, or whether this maybe also was caused by some other issue. George Michaelson 12:24 So before we go too far down that path, I think the important thing to understand as a consumer of DNS is that with 13 letters available in a highly distributed system, it's obviously not great if one of them is having some problems in terms of availability, but in terms of end user experience, it's incredibly unlikely anyone, mechanistically depending on DNS, would have had any kind of problem, because the root server selection behavior would simply have found an alternate Wouldn't they? This is not a problem at the level of availability of service to the global Internet community, in that sense? Moritz Müller 13:01 Yes, I think this is correct. Also, the root servers are not that often carried by the recursive resolve as compared to, for example, TLD sort of name servers. So indeed, the impact is probably not that would not have been that high. George Michaelson 13:15 So were you able to unpack whether this was a problem in the measurement framework or was actually a real problem at scale with coherence and publication delay, Moritz Müller 13:26 we came to the conclusion that it was probably not a big issue or caused by problems at the root service. But of course, it's very hard to say that with 100% certainty in hindsight, because we had to look into all the raw measurement data that was being collected by ICANN, which they also provided to us currently. And we also took other sources to kind of verify whether we could see also issues with the root service system. And there we found that many of these timeouts were more likely caused by the measurement platform itself, and not so much the implementation, but more the nodes on which the implementation was running. So they are relying on a bunch of different hosting providers, virtual machine sometimes. And we saw that at least some of these nodes had quite some some availability issues themselves, which then looks from the outside as if the root service were unavailable, where, in fact, probably the vantage point that carried out the measurements had some community issues. George Michaelson 14:26 We've had a conversation recently with Cristel Pelsser from Louvain University, who's looking into BGP, and she had quite interesting ideas about the impact of selection of your vantage point into a distributed system, and the impact that has on the nature of the data you see. So I think it's kind of interesting that you had an analogous situation here. Duane, do you think that there's a bit of a latent problem here, that the amount of investment you make in a measurement framework probably has to be comparable to the quality of investment you make? In the actual platform itself. Duane Wessels 15:01 I do actually, yeah, I think that's a very good point. I mean, as Moritz said at the start, if we talk about the root server system and all of its instances and anycast sites and whatnot, it's on the order of 2000 say, today, right? Which is a lot. And then if you look at the RSSAC 047, Specification for how you build this measurement platform. It specifies 20 vantage points. That's two orders of magnitude smaller, [George: right] That, I think it's unfortunate in a way, because it does come down to money. You know, how much money are you willing to spend on a measurement platform and these vantage points? [George: Yeah], and I certainly think that, you know, the more vantage points you have, the higher quality measurements you're going to get. George Michaelson 15:39 But reading the report, I got a feeling that there was also multiple dimensions to what good measurement points are, because you need the points to reflect the volume of query and the attraction of query to the node. It's kind of you want them to be important to the clients as much as you want them to be widely distributed. And there are some risks in choosing points. If you deliberately chose points of measurement that are located behind, say, VSAT networks in Island communities in the Pacific, you're going to have a massive component of your measurement behavior. That's really nothing to do with the root server. It's about that location. So diversity of measurement points, yes, scale, yeah, I can see that, but I'm not sure it's as straightforward as "more" Duane Wessels 16:24 when we were writing that document, it's something that we spent a lot of time talking about, in a way we sort of struggled to come to a conclusion. You know? The conclusion that we did come to was that there should be 20 and they should be spread among these geographic regions evenly. [George: Yeah] that's one way to do it, but it's maybe not, you know, it's maybe not the way that you would do it, or that I would do it myself if I was, you know, starting from scratch. So there are different trade offs you would make in choosing these vantage points. You know, that's honestly, one of the things that we wanted Moritz and their team to study was, you know, are the vantage points distributed the way that they should be? You know, did, yeah, we get that part right, George Michaelson 16:59 Moritz, have you considered use of mechanisms like Atlas to get a higher volume of measurement, perhaps to run asynchronously alongside this platform? Moritz Müller 17:08 We actually used RIPE Atlas to confirm whether the root service had a timeout or not. So the idea was here to use kind of the swarm intelligence of web Atlas vantage points and to see whether they also observe timeouts for the root service the same time the initial implementation. of ICANN also observed timeouts because the RIPE Atlas measurement platform has built in measurements towards all the root servers itself. So we could download all the measurement data by RIPE Atlas and see whether we would suddenly see a drop in reachability by RIPE Atlas post for a certain site. At the same time, we saw also unavailability issues with the service system by the initial implementation. And here it showed that it was indeed useful to have a large number of vantage points, because we think thereby, we got a quite strong signal in some cases where the root service system or a certain node of the service system actually became unavailable for some reason or another. That turned out to be quite useful. But I guess the RIPE Atlas platform also has some some other issues as well. And for measuring availabilities was probably fine, but if you would probably measure things like response time in general, and it would have been probably George Michaelson 18:25 they're incredibly small devices, and they are quite bound up in their own clock cycle and timing. So I think you would want to have, perhaps the Atlas anchors or larger machines at scale playing a role. But more to the point, given this is a governance question, and this is a public process outcome, because RSSAC 047 isn't just some random document, it's the result of a group decision what should be done. And so to the extent you're saying maybe there needs to be improvement here, we're really looking at a -bis process on this document, aren't we? There's got to be a discussion to revise the basis of measurement. Do you think that could happen? Duane, Duane Wessels 19:04 I certainly think that with the report from Moritz and the consortium, there will be a revision of the document to update a few things as well George Michaelson 19:12 Moritz without an infinity list of things you didn't look at. There were some specific things you know that you didn't address here, weren't there? Moritz Müller 19:19 Yeah, that's correct. So, for example, we did not look into the correctness measurements that also should be part of the initial implementation, which is actually not implemented yet. And we also didn't look into other aspects, like the cost of the deployment and how to scale things. We mainly focused on the two aspects where operators found that the results were a bit incoherent, and we didn't look specifically into the other aspect, yeah. George Michaelson 19:46 Well, that seems like a reasonable prioritization, because if you know there are questions on the table, you want to address those immediately to make sure that things are on track. But it sounds like there might be a second round of activity here, because clearly, how. Having four defined measures, if one of them is not available, that's something that's going to have to be addressed, isn't it? Do you think that's likely? Duane? Duane Wessels 19:46 You know, one of the things that I was particularly interested in to get Moritz in his in his team's perspective on was, did 047, get some of the formulas and metrics right? We've talked about that a little bit already with with respect to the publication latency, but also for me, you know, the availability formulas are kind of tricky, and I'm not sure that that we have the right thresholds, for example. So Moritz said that in many of the months, the root server system did not meet the defined availability threshold, which was 99.999% so five nines of availability, which is a pretty high bar. I have to say that is astronomically high bar goal. I mean that when we discussed inside APNIC, we were very explicit that was simply unachievable at any cost basis we could afford to implement even four nines availability is a real struggle to implement. You have to do so much work. Yeah, so that is the bar that, you know, RSSAC caucus set for this specification. But as we found it's it's hard to meet. You know, we come up with these formulas and rationales, but I suspect that if we go back and revisit this, that may be one of the discussions we have. George Michaelson 21:17 Yeah, I think this is quite interesting, because to come back to the governance point, it's the kind of thing that we can talk about round the table and say, well, these are just incredibly difficult to do, but it's kind of a literal requirement. It's a goal here. This isn't something people just randomly said. There was a reason behind picking that number, and so entering a room to say, maybe we got that wrong, you really need a lot of evidence behind you to take that argument forward, don't you? So I think these kinds of exercises, they're high value, high quality outcome. Moritz, did you actually make any recommendations going forward? Moritz Müller 21:52 We didn't make any recommendations regarding thresholds, because I think this is probably more up to the community or RSSAC caucus. We did make a recommendation regarding the publication delay of root zones, because, as I've mentioned, the initial report showed that there were no problems with the publication delay for the root zone system as a whole. However, there were public reports as well that one of the root servers did not meet, not publish the root zones on time in multiple occasions in one certain month. So we looked a bit deeper into that, and we found that the initial interpretation actually did measure that this certain root server did not publish the root zone on time. However, the metric that was being used and that aggregated the measurements did not reflect this, because it was using a quite, quite simple way to calculate this. And with this calculation, root servers could have a very high publication layer on many occasions and would still perform adequately. And so there we suggested to maybe use a different way of calculating this, this metric. However, it's probably still up to the RSSAC Caucus to decide what this maybe was on purpose in the end, but I think we showed that the measurement system in this case was reliable enough, but maybe the metric could be fine tuned in order to show these kind of great delays, Duane Wessels 23:23 I do want to say that root server, system publication latency is a really kind of a tricky thing, because you know, on the one hand, absolutely you want to have, you know you want, you want things to be updated as quickly as possible. That's the best possible outcome, of course, but it's also okay if things are not instantly updated, right? I mean, the zone changes very slowly, very deliberately. If a root server is delayed by an hour, 12 hours, even, you know, sometimes days, [George: yeah] in most cases, things still continue to work, right? It's not the end of the world, George Michaelson 23:59 yeah, when we think about defined behaviors like Key Rollover, the raise signal, bit transaction, where you turn on, "we're about to do this". It isn't a five minute signal. Is it Duane it's weeks. There's a week of or more of run time for people to cohere to the existence of this new state Duane Wessels 24:17 In the DNS right? There's the SOA record that has a parameter which says how long you can continue to serve a zone that you've been given as a secondary name server until it expires. And for the root zone, that is set to seven days, that's a week. George Michaelson 24:31 But set against that would be if I, for a moment, put myself in Moritz's shoes , if .NL had some catastrophic problem, and I'm touching wood in every direction. But if they had a need to change r6 or other parameters or NS listings against their data in the zone, in strict sense, that's a change of the zone, right? And so they're looking at their lords and masters directing stability of the .NL namespace, asking, Why isn't this? visible globally in the Internet like that, and the technical response, well, it kind of is defined that it doesn't have to be. It's somewhat of a mismatch with some public expectations, isn't it? Duane Wessels 25:11 Yeah, absolutely, yeah. That's why I said it's tricky. You know, on the one hand, you definitely want the system to be resilient. You know, you want a root server instance that maybe has lost contact with with its upstreams, to continue to serving data as long as it can, if it's disconnected. But as you said, there are also instances where you want things to propagate quickly, maybe in an emergency situation. So it's hard to find the right balance sometimes, and especially hard to know, you know, how do you how to reflect that in a measurement, pass, fail, sort of a situation. George Michaelson 25:42 Yeah, I like this quality in measurement that it is ultimately both telling you and the governance circuit what's going on with the machine. But I do think it's also beneficial for public confidence. You guys aren't operating root servers in a vacuum. You actually are trying to assess how you perform against some criteria. So it's reassuring for me as a consumer that there is a measurement exercise can always get better. I mean, that's kind of what this is about. The ICANN software had some problems. It was written to spec. There's room here perhaps for some alternative implementations. Do you think that could possibly emerge? Duane Wessels 26:19 That was, I think, the long term goal of RSSAC 047, right? That, you know, first we produce the specification. Next there's what we call the initial implementation, which has been done. We learned some lessons from that. And then the final step is a production implementation, perhaps by a contracted party or something like that. George Michaelson 26:39 So Moritz, do you as .NL Do you have downstream view on this? Are you thinking about applying similar measures to your own governance needs in the quality of the anycast cloud for .NL Moritz Müller 26:50 I think this is a very interesting question. So we of course, have different monitoring systems in place to measure our own performance, but measuring externally, which RSSAC 047, tries to do, is probably a very interesting aspect to include as well. So here we already rely on the RIPE Atlas measurements. They have something they call DNSMON, which specifically measures all the TLDs and the in service of the TLDs as well. And we look at those statistics, but we, at this moment, do not have any system similar to RSSAC 047 in place. George Michaelson 27:29 So Moritz, reading the blog post, I got the feeling that you have some ideas that might be more applicable to large, distributed systems in the wide Do you see opportunities for people to reflect on what's going on here and apply it to other systems and other contexts. Moritz Müller 27:44 Yes, I think there are definitely some some lessons to be learned, even though maybe these lessons might not be that new to people experience with measurements. But I think what the initial implementation of RSSAC 047 showed that choosing only 2 .. 20 vantage points for measuring a system with maybe 2000 different sites or instances is probably not sufficient to get a clear picture for several reasons. First of all, it's impossible to measure all the different sites, so it's also probably a bit harder to say something about the quality of the measurements. And second, the measurement results can be quite heavily influenced by the performance of the vantage points themselves. So we, I think we make the recommendation to probably look closer into the performance of the different measurement vantage points by monitoring them as well and maybe increasing the number of vantage points. George Michaelson 28:37 It's turtles all the way down. I no longer have obligations in public service delivery for APNIC. But I remember when we were trying to do liveness checks and up checks, they were incredibly affected by site selection, and people would drive really hard to green lights on the dashboard, but there would always be one point of measurement, perhaps in India or in China, behind adverse conditions. And as a system administrator, it's very easy to say, Oh, well, that's a very difficult place to deliver service to. But then you sit back and you think, wow, there's this billion people in that economy. And if it's saying it has unreliable access to our services, we really have to think about that. So I think the site selection thing is absolutely worth exploring. Duane Wessels 29:24 Yeah, I think so as well. Like I said before, we did spend a lot of time on that, in producing the document, I suspect that, you know, if another RSSAC work party forms to revise that document, we're going to also spend even more time on it. You know, maybe we'll have some new insights at this point, and maybe we'll come up with something different. George Michaelson 29:41 So the original RSSAC 047, and your outcomes report, these are available online. Moritz, Moritz Müller 29:49 Yes, that's correct. So we did publish a short blog post on APNIC and on our own website, and we also made the report publicly available. George Michaelson 29:59 Oh, it's really great. I'll make sure that links to those are put online along with the blog post for this podcast. Thanks very much, both of you for coming on the program. It's been really interesting. Duane Wessels 30:10 Yes, thank you, George. It's been great. Moritz Müller 30:11 Thank you. George Michaelson 30:13 If you've got a story or research to share here on ping, why not get in contact by email to ping@apnic.net or via the APNIC social media channels. Also remember that measurement@apnic.net mailing lists on orbit is there to discuss and share relevant collaborative opportunities, grants and funding opportunities, jobs and graduate placings, or to seek feedback from the community on your own measurement projects, be sure to check out the APNIC website for all your resource and community needs until next time you.