Emile Aben 0:00 This started as a conversation about we have, we really have this bias. If you just count BGP peers, we see this upstream in 50% of the peers, and 50% we see another upstream for a particular network. That tells you these two upstreams are used, but not their relative value within quotation marks, because it's not random sampling. AS hegemony. Started as a conversation around, how can you undo this bias that we have in our system? In graph theory, you have this concept of betweenness. So you have any two points in a graph, if there's a point that's more in the middle, how between is that point between all of the other ones, like an important say you have a star topology where everything is connected via a very big central node. Then that central node will have a betweenness of 100% George Michaelson 1:02 you're listening to ping, a podcast by APNIC discussing all things related to measuring the Internet. I'm your host, George Michaelson, this time, I'm talking to Emile Aben from the RIPE NCC about his approaches to measurement and managing the data in his research, Emile has access to a wealth of raw inputs from the RIS BGP collectors and the Atlas system deployed by the RIPE NCC worldwide. Unfortunately, the distribution of the data collection points isn't even and without some careful thought, it can bring distortions to an analysis. Emile and his collaborator Romain Fontugne from IIJ research labs in Japan, have worked on methods adjusting for the asymmetries to get a clearer picture of what's going on with routing changes. Emile has also been exploring ad hoc analysis with his co workers in RIPE. he has some novel visualizations and newer approaches to holding and processing data for analysis, and that makes this ad hoc analysis easier in the face of the ever increasing volumes of data being collected. Emile, Emile Aben 2:15 yes, George Michaelson 2:16 welcome to ping. Emile Aben 2:17 Thank you. George Michaelson 2:18 So could you briefly in one or two sentences. Introduce yourself to the audience. Emile Aben 2:25 I'm Emile ARIN. I work at the RIPE NCC as a data scientist. My topics of interest are how we can actually use Internet measurements to make the Internet better. What really useful insights can we get out of this vast trove of data that we're collecting. George Michaelson 2:42 So you've been working at the NCC in a research role, and before that, you were working with CAIDA in the USA. Emile Aben 2:49 Yes, I was working in San Diego. That's actually how I started on this topic. [George: Yeah], came in as a data administrator there and got in the originally, I'm a chemist. George Michaelson 2:57 That's a hell of a bridge. Emile Aben 2:58 There's nice analogies on like, I actually worked on mushroom, like mycilia, which is like, George Michaelson 3:04 when you think about some of the visualizations of connectivity in the Internet. [Emile: Yes], these are the images that come to mind Exactly, exactly. [Emile: Yes]. So this is a conversation about visualization, data analysis, looking at BGP and Internet technology. But you kind of hinted at the "use" question there. And it feels like the fundamental point here is, what can we do to improve resiliency, robustness in the Internet in general? Emile Aben 3:31 And there's quite a lot, what I think of unsolved problems where we would like to kick the ball, move forward a little bit on sort of aware there's a physical layer, right? [George: Yeah], cables, routers, and then there's what we see in Internet measurements, which is AES and IP addresses, George Michaelson 3:49 Which is in the BGP system, Emile Aben 3:51 in the BGP or in TRACE routes, or in pings, [George: yeah], pings are fairly easy whenever there's, like in BGP, a path in a trace route, where there's a sequence of IP addresses, a lot of like, the locality of these or what types of infrastructure they're using is, well, there's a lot of unsolved problems there, like IP geolocation, for instance, I would consider the edge pretty much that's solved by the geoloc providers for the most but then if you go into like the core of the network, there's a lot of guesswork. Still, if you hit an IXP, for instance, you see the IXP land. That's you get hints out of host names for someone. George Michaelson 4:30 We're hunting in hints. We're hunting in the domain names, looking at things like airport codes, trying to infer location, because we don't have strong positive signals, declarative signals about real world location. So this mesh of how does the physical layer interact with routing is kind of right down at the pointy end of this problem, isn't it? [Emile: Yes], there's a concept that you have discussed several times with me that also relates to your reasons for doing a bit of a research journey. At the moment, you're on route here at the APNIC offices, but you're heading to IIJ research lab. Emile Aben 5:08 There's, I have a strong collaboration there, and this is on the Internet health report project there, [George: right] where it's also using a lot of data that we're collecting at the RIPE NCC, with RIS, our ARIN information system, and with RIPE Atlas and to sort of like, make a system on top of that. And it's like, out of this collaboration, there's a lot of things that already came to unravel pieces of the puzzle. George Michaelson 5:31 So IIJ labs, that's a small research group based in Tokyo. [Emile: Yes]. Kenjiro Cho is the chief scientist there, [Emile: Yes]. And you're working Emile Aben 5:40 with Romain, Romain Fontugne, who is, I think he's deputy head right right now, and phenomenal collaborator on on these types of topics. George Michaelson 5:50 A problem, the way it's been described to me, is this idea inside BGP, the relationships between BGP speakers forms this thing that some people call the customer cone, and you guys have been calling AS hegemony. Could you talk a little bit about that? Emile Aben 6:08 Well, I think it's fair to say this started as a conversation between me and Romain on the bias in BGP data, because, like, what we do in RIS is we have collection points, and it's a very nice set of collection points, and we're very grateful for the people feeding data into it. And one downside of the way we collect it is that it's like, it's an opportunistic sampling of the Internet, right? We get the data feeds that we can get. It's not like, if a statistician would want to have, like, a random sampling, but this is not a random sampling. George Michaelson 6:39 This is a self selecting community of people prepared to make data available, Emile Aben 6:45 yes, or as Randy Bush calls it, this is the clue core that you're measuring, George Michaelson 6:49 enough clue to actually be prepared make information available. Emile Aben 6:53 Yes, that's one of the biases. And there's geographic bias, and there's types of networks who would like to peer with you, who have, like, incentives or disincentives to peer, cultural differences, all that there's a large fascinating group now, that's the personal use ASNs. That's like individuals who want to learn more about the Internet, they would like to feed all of their data, but typically they have the same upstream. So we have, like, an over sampling of these networks currently. So that's just an example of the biases you get trying to measure a complex system like the Internet. And this started as a conversation about, hmm, we have, we really have this bias. If you just count BGP peers, we see this upstream in 50% of the peers, and 50% we see another upstream for a particular network. That tells you these two upstreams are used, but not their relative value within quotation marks, because it's not random sampling. AS hegemony started as a conversation around, how can you undo this bias that we have in our system? Aha, in graph theory, you have this concept of betweenness, so you have any two points in a graph, if there's a point that's more in the middle, how between is that point between all of the other ones, like an important say, you have a star topology where everything is connected via a very big central node, then that central node will have a betweenness of 100% right? George Michaelson 8:19 Because everything has to pass through. Everything has communicates between any other two points. Emile Aben 8:24 So that's like this. Betweenness is sort of like very common thing used in graph theory, [George: right] or in analyzing graphs. But if you do this for the BGP graph, because of the bias, it is very biased if you do that, and AS hegemony is basically a way of countering the bias. So we can sort of have a betweenness metric. So we have for each of the nodes, we have a number saying, This is how important it is for the connectivity of other nodes. It can also quantify the importance of nodes for each of the edge nodes. George Michaelson 9:00 So if we consider nodes in BGP as being stub nodes or transit nodes, and the nodes that are both a bit stubby and a bit transit-y for the stub nodes, it's really very clear to understand their role in the connectedness. They have very high dependency on someone else. Emile Aben 9:20 These are typically not hard, right? You have, like, if I have a stub with, like, one upstream, then it's like, it's abundantly clear that this nodes needs, like, 100% dependent on the upstream. But that becomes less the more connected the node is, like, oh, it has two transits, or it had now, it's like it has two transits and IXP George Michaelson 9:40 or it has one or two transits, but it's connected to an exchange point and is picking up a large amount of no transit connectivity in the city. Then it becomes more complicated. Emile Aben 9:50 Yeah, there's all kinds of like where, depending on if we have viewpoints and RIS or not, your view gets distorted, [George: right] Just like with vision. I. Just look at a scene, and I see the things that are close to me, but things far away, they're less clear. And in RIS, if you just take the concept of like BGP is an information hiding protocol, the further away you are, in AS hops, you will probably see transit type things or things that are close by. You see a fuller picture, but you also are not George Michaelson 10:20 so in RIS, you have more information and more certainty from close objects, but things that are further away in path terms from the RIS collector, [Emile: yeah], you have less detailed knowledge Emile Aben 10:31 about them, yeah. So BGP is an information hiding protocol, so you only see the best path, the George Michaelson 10:37 best path to you, but someone else measuring from another vantage point. Might have seen a different best path. Emile Aben 10:44 Yeah, like we're only collecting the things that we see in RIS. We are not collecting the things we're not seeing right George Michaelson 10:49 RIPE is providing the data. Romain implemented a weighting function using a statistical approach or a mathematical approach. How would you say that this is done. Is it a statistical weighting function? Emile Aben 11:02 I wouldn't call it a weighting function. It's more of like it sort of like a smart way of averaging. It takes out the extremes on either side of a distribution George Michaelson 11:12 problem with RIS, it's self selecting people. Your points of measurement aren't necessarily reflecting the real distribution of network relationships. The hegemony method takes out the outliers and leaves you with a more valid core set of data. [Emile: yeah], it gets rid of the extreme cases, and it makes it easier for you to use a biased sample set and work. Emile Aben 11:33 Yeah, it's an unbiasing And it's like, I've seen examples of RIS being used raw AS relationships. And then, for instance, if, because we have a peer in a particular African country, that country becomes the core of the African Internet, it almost looks like whereas, if you ask people in the know, they're pretty sure that this country is not at the core of the African Internet. George Michaelson 11:57 This idea AS hegemony. It's a technique that was developed quite a while ago. There is a paper that you worked on with Romain in Emile Aben 12:07 Yeah, yeah, I think was published in PEM quite, quite some years ago. Yes, We'll put it in the blog. But can you just give us, like, a brief summary of what this model tries to do? Well, it tries to take care of the of the outliers, right? It just removes the bias that is in RIS, because we we have certain peers in certain countries, and this tries to get more unbiased data, and that's maybe, as a computer scientist, you might be able to work this out. But if you also want to go across science disciplines into economic data, then having understandable data sets on like these are the networks of importance for a certain economy, I should say, right for a certain economy, AS Hegemony is a great way of getting rid of all of the bias and outliers, right? George Michaelson 12:53 So if you were collecting data on things like wheat markets, and you just happen to have an overwhelming amount of data about shipment of wheat in the Mediterranean. You might be led to believe Mediterranean wheat trading was the center of the wheat universe, but it actually isn't, because there is a huge amount of wheat that's been shipped in Canada and Australia. It's simply that you're measuring it too much. So in Internet terms, for instance, I think you've said if you had a point of collection in RIS in Africa, it might give you a bias in beliefs about the importance of that economy. Emile Aben 13:28 I've actually seen that happen that we're lucky to have a peer in one of the African landlocked countries. People have used data from that specific peer. And if you don't take the bias out with a system like AS hegemony, it will appear like this particular country is the center of the Internet in Africa. George Michaelson 13:44 But if you talk to people who really know, Emile Aben 13:46 To the people in the know of the African Internet, they say, no, that's just an artifact of the raw data. But if you use something like AS hegemony, you get rid of that, that bias. So that's one of the examples of where such summary statistics instead of raw data, makes total sense. George Michaelson 14:04 So you've been looking particularly at visualizations of this, and you were showing me something earlier on today. Little hard to do diagrams on radio, so people are going to have to look at the links on the blog. But could you talk a little bit about how you typically visualize this kind of data, Emile Aben 14:21 say you have a particular country, and you want to understand which are the important nodes for interconnectivity to the rest of the Internet. What you can then do is there's a like flavor of AS hegemony that just looks at the resources in a country and then just tries to figure out for each AS how relevant is that, AS for all of the resources in the particular country, and so you get sort of like a score per AS and like these scores, you can see them as sort of like the size of, well, you can visualize it as circles each AS is a circle of a certain size, the more important, the bigger the circle George Michaelson 14:59 put it into a mesh drawing system, and say, you figure out how to balance out the relative sizes and connectivities to have the least number of overlapping lines, and then the relative size of the nodes is an indication of their relative importance. Emile Aben 15:16 I've been experimenting with, like a 3d version of this. And if you just do a force directed graph, you have the nodes and edges of a graph, and you, yeah, you have these force directed drawing libraries, and you just randomly put them and that let the forces, like the nodes are attached by sort of springs. George Michaelson 15:34 And the strength of the spring is a function of how important the node is. So the balancing of the forces in this network, Emile Aben 15:41 these things will eventually balance. There's other ways, like, for instance, the CAIDA AS core poster, uses the AES rank importance function and where it draws all of these. George Michaelson 15:51 So there are other methods that people use for displaying AS connectivity graph placement models that work in in the browser. So is this like a JavaScript library running in your browser. When you do this visualization, Emile Aben 16:05 there is d3 force graph type things. There's actually like a former colleague of mine, Vasco Asturiano, did some great work on d3 based things that you could actually use to do all of this. As an example, I did a lightning talk at their last RIPE meeting, and I included some of the this is very experimental, like a 3d visualization of these as graphs. And I'm still not sure if this 3d it works as a gimmick. Maybe 3d gives you, like more space to place the nodes, because what you actually see is, this is a very dense graph, and if you put it in 2d you'll just get a beautiful mess. Well, the whole Internet is a beautiful mess, but that's a different conversation. George Michaelson 16:46 This allows you to maybe hone in and look at particular sub parts without having it occluded by other components. You can kind of move around in your model. Emile Aben 16:55 Yeah, so and like for this lightning talk, I chose to this is about Spain and Portugal and the Iberia Peninsula during the power outage that they had there. What I wanted to express is just the complexity of the Portuguese network and Spanish network next to each other. And you can see the Portuguese is a bit simpler. It's a smaller country, of course, and the Spanish one is more complex. And if you have them side by side, you really see the differences in their complexities. George Michaelson 17:20 So we might come back to that specific one in a few minutes, because I think you want to talk about some of these visualization toolkits. But can we just talk a little bit about the other half of this question space? Because, as well as RIS and the AS hegemony, you've been looking at ways of using the Atlas anchors to do some measurements as well. Could you talk a little bit about that? Just to be specific, what is an anchor? Emile Aben 17:44 Okay, yes. So we have, well, RIPE Atlas, one of the biggest measurement platforms out there. We have over 13,000 nodes, vantage points, which you can measure with. And we have about, bit over 800 of these, which we call RIPE Atlas anchors. And so they're, like the stable points in these topologies, because, like, probes are, typically are in homes, George Michaelson 18:07 yeah, I have a probe at home, [Emile: yeah], and it comes and goes as I unplug it, put the vacuum cleaner on, [Emile: yeah], whereas an anchor as one, I'm one of many, [Emile: yeah]. But an anchor tends to be in a machine room, in a rack, in a reliable, stable location, [Emile: yeah], with quite solid connectivity. So the anchors, when you use them, how are you typically using them to perform a measurement? What are you doing here? Emile Aben 18:32 One of the nice things about these anchors is they operate in a mesh measurement. Every time an anchor gets added to the network, it will be put in this mesh, basically meaning that for certain types of measurements, pings traceroutes, etc, it will start performing these towards all of the other anchors in IPV4 and IPV6. George Michaelson 18:50 So there's a fully connected mutual measurement. If I'm an anchor and you're an anchor, I measure you and you measure me, yes, whereas if I'm an atlas edge probe, will I get recruited into someone's random measurement and I might be pinging a root server and maybe one or two anchors, but nobody's necessarily measuring me. It depends the anchors, you guarantee Emile Aben 19:10 the probes are sources of measurements only, and the anchors are sources and destination. So you have a pretty reliable geo location of where these things are both ends, because if you just measure towards an IP address again, IP geo location not being a solved problem, you don't know exactly where the other point is with the anchors. If I have 15 anchors in Sweden and I have five in Latvia, I know that if something between Sweden and Latvia goes wrong, for instance, if a submarine cable gets cut, that my vantage points will have measured this. And it's also like, a lot of times you have, like, you hear about something happening on the Internet, then you're starting to look, hey, what types of measurements have we as a collective of things that measure the Internet? What things have we actually collected here George Michaelson 19:59 that's quite topical, because we have, recently connectivity in the Baltic has been cut. We don't need to go into the geopolitics of the situation. Emile Aben 20:07 The question on intentionality is actually quite interesting, because, like, you saw a lot of press on one side, ah, this was sabotage. This was a there's something going on. And then, on the other hand, you have the cable industry saying, Yeah, but we have cable cuts all the time. It's just that, due to current geopolitical climate, this has become very visible. I don't know where the answer is, and our data is not going to tell you where the answer is in this, but what we do have is we have neutral data on the effects that George Michaelson 20:37 so if you knew before this happened, that Latvia and Estonia, or Estonia and Finland had a certain delay and a certain connectivity, a certain reach ability and path. [Emile: yeah], you have some certainty from neutral measurement using the anchors. How did that get changed? Emile Aben 20:55 Yes, and that's basically what I've been looking into for these we know when there was an outage of certain cables, so the, say, the sea lion cable, which runs from Finland to Germany. And we have like vantage points on either side of these, where we are pretty sure they're on either side. And we have latency measurements that are sort of like a baseline, a baseline latency. Say, it's usually, it's, say, 25 milliseconds to go from one place to another place. Oh, you occasionally you see some spikes and you see some noise, because the Internet is remarkably noisy place, a beautiful mess again. George Michaelson 21:30 But when one of these events happens, Emile Aben 21:32 when one of the events happens, then you actually see a level shift for some of the path between our vantage points, not for all of them, but the fact that we actually see this level shift happening in latency. Says, Yes, we are actually collecting this event. The other thing that you do with, and this is just looking at the latency data that we collect with RIPE Atlas, the other part of it is the loss data, right? You have a stable state of loss over time. Say, ideally, the Internet is heavily over provisioned, so you have no loss at all. George Michaelson 22:06 But the reality is, there's a certain level of loss we have come to accept and expect. Emile Aben 22:11 Yes, or you see path that like this diurnal pattern of loss, and there's all kinds of beautiful variations of this. But for instance, what we saw a bit for the sea lion cable cut, we see the latency shift between some of our nodes, but not all of them, but we do not see a packet loss shift, which I interpret as there's no George Michaelson 22:33 There was no capacity issue. Emile Aben 22:35 There's no capacity issues, because, like, if you see structural changes between packet loss, that probably means that your links get saturated. And the conclusion I draw from this is that, in this case, the Internet routed around damage George Michaelson 22:49 there was sufficient capacity available on alternate path. [Emile: yeah], even though delay may have risen, loss over the path didn't rise. Emile Aben 22:57 Yeah. And also, like, if certain part of take the highway system. There's like five highways between point A and point B. Highway has road works, and the other ones don't have traffic jams. That's kind of the situation in the Baltic. Whereas if you would have cut road works on one of the path, and then you had huge delays on all of the other available path. You know, there was a capacity problem in this case, for in as far as we can observe, with RIPE atlas, of course, we did not see these capacity problems because there was no packet loss, no traffic jam equivalent on the Internet. There was an earlier event that we later started analyzing, where three or four cables off the coast of Africa, there was a landslide, George Michaelson 23:41 so quite a lot of loss in the system, rather than a single point loss. Emile Aben 23:45 Yes, and also like and if you then look the anchors between South Africa and the UK or Great Britain, which must have gone through one of cable systems, as far as as I understand the Internet there at the point where there's an underground landslide that took out these four submarine cables. George Michaelson 24:03 And that was quite a catastrophic event. Emile Aben 24:05 That was quite a catastrophic event indeed, because what you see there is that you see latency increases, but also a lot of packet loss over these paths which indicate that there was a capacity problem on the links that were used between South Africa and the UK, which is like roughly the cable path that are followed there. George Michaelson 24:26 Anchors are more stable, more reliable, but they're also a smaller set, you say, 13,000 probes and about 800 anchors. Yes, this is something you'd like to see more investment in, Emile Aben 24:37 yes, and especially if we, if we want to be able to sort of like measure the if the resiliency of the Internet in certain places, again, coming back to the like, the mapping between the physical and actually the Internet we observe in our data. This link is rather complicated and changes all the time. Questions of how resilient is the Internet are either. Super trivial or, well, I guess I'm not an expert in the Pacific, but if I just look at the cables, if there's a single cable going into an island, just not looking at satellite type connectivity, which is an up and coming thing that's not able to carry the capacity that the cable has if you cut the cable, or if the cable has an outage, or if there's a volcano, it's trivial, right? There you go from 100% capacity to zero capacity. So then your resilience question is trivial. One cable, not very resilient. George Michaelson 25:32 If you look at a situation like the Taiwan Straits or the Straits of Luzon, there's much path diversity, but the actual physical component, it's a relatively small space. So although you might have apparent path diversity and apparent resiliency because of different bearers, they may all come down to the weakness you just mentioned in Africa, a single landslip might actually expose that you have less path diversity in the physical layer than you think you have. Emile Aben 26:01 Yeah? And because we don't know which carriers have, what traffic over, what cable and George Michaelson 26:07 Or what alternates they have, Emile Aben 26:08 what alternates they have are these protected or non protected path and all of that. It's very, very complex. George Michaelson 26:15 More anchors in Asia, in the Asia Pacific, would be beneficial, Emile Aben 26:18 yeah, because of this complexity you can only for now my conclusion is you can only measure this after the fact, like, if the cable has a problem, then you can measure what the effect is of this type of problem. It's not something you can predict. There's not we have a perfect world where we know all of this. So if this cable goes out, we will know what happens. I don't think anybody has this knowledge. George Michaelson 26:44 Perhaps that's something we should be aiming for in using these systems. Can we have a predictive component towards models of resiliency? Emile Aben 26:52 Yes. And what would these models look like, actually? So how would you properly model this under? What assumptions can you make so you can make this very complex problem, tractable? George Michaelson 27:02 Yeah, so you mentioned earlier that you looked in particular at the power outage on the Iberian Peninsula. And I think this might be an opportunity to talk a little bit about a technology stack that's been emerging. I think you said that you've got started to look at different ways of building tools and techniques. [Emile: Yeah], that are kind of about ad hoc measurement. Could you talk a bit about Emile Aben 27:27 that? Yes, so we had our ride meeting in Portugal, in Lisbon last May, and not that long before that, there was like a power outage, which cut for the Iberia Peninsula, right, Portugal and Spain. George Michaelson 27:41 And it kind of also crept up into parts of southern France and into Europe in general, because of the interconnected interdependencies of power, the electricity power network in Europe, but principally it was the Iberian Peninsula that just went black. Emile Aben 27:57 Yes, I got asked the question, what can we actually see about this in data. And first I looked at the Atlas anchor mesh. Again, we see some effects of this. Like, basically, conclusion of the analysis I did was you have like, three types of connectivity, three parts of the Internet to this. There's, like, this part of the Internet that goes out, which is typically the edge, [George: yeah], so like end users, there was no Internet, and there's parts that have, like backup power that's really last longer, but also didn't last until the power came back. And there's just these things that are stable as a rock. George Michaelson 28:30 So the data centers have DC power, and they have generators and electricity, but the consumer side, their telecommunications infrastructure, if it's old enough, has huge lead acid battery stacks in the phone exchanges, but they're not equal to an outage of this duration, and none of the homes had it reliable, independent power, Emile Aben 28:52 if you just look at RIS data. So the routing data, a lot of like Spain and Portugal, was online. So the Internet was connected. It's the Internet is a network of interconnected networks, and that part of the Internet was still functioning. George Michaelson 29:04 The connection part the BGP speakers were all in facilities that maintained independent Power. Emile Aben 29:11 if you just look at the outside, there was like the majority of the Spanish and Portuguese Internet was online, even through most of the outage. A very strange story to be telling, George Michaelson 29:23 yeah, because for the experience of any average Portuguese or Spanish person, Emile Aben 29:27 they say, We have no Internet. George Michaelson 29:29 You could not use pay wave, you could not touch the pay you could not make phone calls, you could not watch TV, you could not do anything, [Emile: Yeah],, but BGP was going along quite nicely, Emile Aben 29:40 yeah, this is also like looking at the Atlas anchor mesh. For a long time, we saw like there was the outage happening, and you basically hardly see it on the there's no latency shifts or loss shifts. The interconnected Internet measured there, which just worked. And then after a couple of hours, you see things starting to deteriorate. Right? That's also interesting, because what I used for this was the anchors in Spain and Portugal in a single mesh. What do we actually see in the country? The problem there is that your measurement infrastructure is part of the phenomenon that you're measuring. So if power goes out, your vantage points go out, and that's like, on the one hand, it's very obvious, but it also like you're measuring a phenomenon from things that are affected by the phenomenon. George Michaelson 30:27 So it's a bit meta. The measurement itself starts to change as a function of the situation it's measuring. Emile Aben 30:34 Yes, that's why the BGP data is the more obvious thing to look at, because that just has like this part of the Internet connected, and how is it connected? On the AS level George Michaelson 30:46 RIPE have been working on making tools to make it a bit easier for people to do this kind of ad hoc analysis. Emile Aben 30:53 So we're working towards that. What you have for RIS, if you go into the nitty gritty of actually wanting to consume all of that data we have this the MRT format, and we have data in the MRT format, and there's a handful parsers and George Michaelson 31:08 not the easiest data format in the world to work with. Emile Aben 31:12 The nice thing is, it's standardized, but it's not the easiest data formats in the world to work with. So one thing that my colleague, Ties de Kok is experimenting with easier ways of access to this data, and like data science has moved on, and this is maybe surprisingly simple, or there's this parquet data format, just sort of like a column or storage that's from what we've been looking at is very, very well suited for the types of ad hoc analysis you would want to do on BGP data. George Michaelson 31:45 So Ties is now looking at constructing services for researchers exploiting this more tractable way of holding and looking at data to make it easier for people to go into the data and find things of interest during events that are a bit more ad hoc, and perhaps questions being asked without any prior ability to construct an experiment. Emile Aben 32:09 Yeah. And the amount of experiments you can do in a certain amount of time increases the more you increase, sort of like the ergonomics of your data. This is all about like time to insight, right? If you're collecting a vast amount of data. Well, the more data you're collecting, you're actually increasing, not decreasing, your time of insight. What you're hoping to do with more data is you want a more comprehensive insight, but in reality, you're only adding a little bit of like you're just adding the amount of compute time that you need to come to to an answer. And George Michaelson 32:48 It's kind of counter intuitive, more data should mean better answers, but it winds up meaning more work to find the information, to construct the answer. Emile Aben 32:57 And it's worse. It's not only more work, but there's the more data you collect, the more outliers there's going to be, and the outliers George Michaelson 33:05 are so this is quite similar to the problem with AS hegemony. You have to understand how to filter your data, to get rid of the outliers, to have a reliable data set to work on. So will Ties be presenting on Emile Aben 33:19 this. We're testing this, and yes, once we're happy with sort of like the format of this, but what I want to go back to is that for the Iberia Peninsula situation, I was asked the question a week before the RIPE meeting, what do you guys actually see in RIS for this? And I knew Ties had this thing, had a early prototype of this ready, so I was able to test a lot of hypotheses on, what could we actually see in this data? Is it this? And with these types of ergonomic tools, you can just test a lot of hypotheses, like, is this actually visible in this way, or is it that way? And actually, in the presentation, I started looking at using BGP communities that signal this route was picked up in Spain via one of the tier ones. We didn't have any tooling for that ready. But because this data is formatted in such a way that you can easily access all of the nitty gritty and do all kinds of data transformations, George Michaelson 34:17 you were able to do an ad hoc development to make a lightweight tool to summarize that BGP community information. [Emile: yes], it was easier to construct a purposeful test of the hypothesis because the data was in a more tractable form. Emile Aben 34:32 Yes, so that's the key there. And like the fact that parquet files, they're not rocket science. It's just one of the columnar storage formats currently available, but having that as a match to the data that we're using, and having the ability to easily scale things that was I'm very excited about how this just decreasing the amount of time you need to go from an idea of, hey, is this Actually what is happening towards George Michaelson 35:02 it's easier to see it in the data quickly so you can formulate a proper approach to it, because you don't have to scrabble around in a complicated data suit. It's just easier to test your hypotheses. Emile Aben 35:15 So that's the realization that the ergonomics of your data is very important to this, and especially in complex data like BGP data, or, say, traceroute data, that's that's another one that maybe we will be able to take a step at a later point. You really need ergonomic tools to yet to test hypotheses on how you could analyze it. George Michaelson 35:36 Emile, if anchors are emerging as an interesting way of getting stable mesh measurements for events taking place. Are there economies in the Asia Pacific footprint It would be helpful if you could get more anchors? Emile Aben 35:49 Yeah, definitely. Because I'm just looking at a list here of where cables are landing and crossing that with where we have RIPE Atlas anchor deployments. And for instance, I see 71 cable landings in the Philippines and only two anchors. George Michaelson 36:04 So definitely, more anchors in the Philippines would be interesting. Emile Aben 36:08 Yeah, and it's not just about the number of anchors. We want some diversity there. If they're all in the same as in Manila, that would not be George Michaelson 36:17 as if you had diverse AS within the Philippines. Emile Aben 36:20 Yes, we're looking into how to go do that. George Michaelson 36:23 And are there other economies in our region? Emile Aben 36:26 I must say we have two in the Philippines. So I'm very, very glad that we have these two. It's just that more will allow us to see a broader range of things happening. And I'm just like looking through the list at China is obviously George Michaelson 36:40 it's always challenging, but nonetheless, it would be useful to have more in China. Emile Aben 36:44 And we have four there, but also 25 cable landings I see here Malaysia, 20 cables, four anchors too. George Michaelson 36:51 So more anchors in Malaysia, in the Philippines would be attractive. Emile Aben 36:54 Yes, definitely. So these would be, if I get a Santa Claus, then these would be on my on my list. We're getting more into how to like because these cable landing that's where you can connect to countries that are not bordering your country, right? So they are very important for your physical interconnectivity and then having a way of seeing how the stuff on top of that actually reacts to what is there. So that's why it's important to have a diversity of measurement points on them and seeing effects of things happening. George Michaelson 37:32 Emile, that's been really great. Thank you for coming on Ping. Emile Aben 37:35 Oh, thank you. Thank you for having me. George Michaelson 37:38 If you've got a story or research to share here on ping. Why not get in contact by email to ping@apnic.net or via the APNIC social media channels also remember the measurement@apnic.net mailing list on orbit is there to discuss and share relevant collaborative opportunities, grants and funding opportunities, jobs and gradual placings, or to seek feedback from the community on your own measurement projects, be sure to check out the APNIC website for All your resource and community needs until next time you you.