Geoff Huston 0:00 I suppose that's why routing has been such a fascinating and such a difficult problem for a very, very long time. Because, like I said, there's no external agency drawing the map for you. There's no configuration tables. You've got to talk to each other and mutually discover how this universe is put together. And you know the idea that there's no external help, you're on your own kiddies, talk to each other and work out the topology of the network, that you all have a consistent view which local decisions bring packets closer to their destinations? Fair enough. It's the topology maintenance protocol. George Michaelson 0:37 you're listening to PING, a podcast by APNIC discussing all things related to measuring the Internet. I'm your host, George Michaelson, this time I'm talking to Geoff Huston from APNIC labs again in his regular monthly spot on PING. Geoff and I talked about his long lived activity, reviewing the behavior of BGP across the years. What's changed in a day of BGP? What's the current size in all the dimensions we measure BGP. Normally, Geoff is looking at aggregate behavior spread out through the year, but this time, he's focusing on the micro mechanics of a single day's events. Where is all the noise coming from, and what is the cost of processing? Do we actually have a widespread problem with excessive chatter in BGP? And is it something everybody does, or is it just confined to a few BGP speakers whose traffic necessarily propagates out into the world. Geoff, welcome to PING. What should we talk about this time? Geoff Huston 1:51 Well, thank you, George. Look, I've spent probably more time than I should wandering in the byways and labyrinthine maze of the domain name system. So I thought it would be time to pop out for a second or two and head to the other Byzantine labyrinthine maze, which is inter domain routing. And I actually want to talk about a weird snapshot of the inter domain routing system and its current state. George Michaelson 2:17 Now you've been looking at the long term behaviors of BGP for multiple decades now. Geoff, I mean, you actually have a massive back record of scale of growth behavior of the system, sort of in growth across the entire surface. And you've also been doing an annual State of BGP report. But it sounds like you're heading down a slightly different track. This time. Geoff Huston 2:41 It is, it is a different track. And this time I thought I'd use a micro view and take the minimal possible data set, one day, just one day, and one BGP session, not two, not three, just one, and look at what happened on that day. Now you know the reason why I think we're interested in BGP is that this is a system that has almost no natural constraint, no rules, no admission factor, nothing. If you want to advertise addresses in the routing fabric into BGP, find yourself an upstream provider. Find some addresses, and, you know, let loose, do it. There's no gating factor. There's no price for playing. There's no ticket required. George Michaelson 3:27 So there's no minimum height barrier here that says you must be this competent to speak BGP, someone who's prepared to peer with you. Geoff Huston 3:36 Well, that's part of the problem. There is no competence barrier. You don't need to demonstrate that you're capable of understanding what you're playing with. To play, you just play. And in some ways, BGP is one of these classic everyone needs it. Everyone uses it. No one's in charge. There is no, what we call routing police. There's no one there to say, look, that was really stupid. Don't do that. Tell you what. 10 minutes in the Sin-bin, contemplate your stupidity and don't do it again. Nothing like that. George Michaelson 4:06 So it's the game of football with no ref and no linesmen. That's kind of tricky. The idea, I mean, the mantra, "we're not the routing police", is something that people like us who work in the registry system are very used to repeating because handing out addresses, there was an assumption on a lot of members of the wider community that that gave us some functional role, monitoring the state of BGP in an active sense, and to have to say, "No, we had no permission to instruct or guide or require in that Space, that's not our role". It's really quite a moment, isn't it? Geoff Huston 4:43 Well, it is, but it's quite a moment for the folk who go, hey, those people over there, they're doing naughty things with their address in the routing space. Take that address away from and we go, "No, it's not our job". We have no recognized... George Michaelson 4:57 come back with a judge's order that we are. Obliged to accept and will act on it. But just because you asked us to, no, we're not going to do this. Geoff Huston 5:07 And even that's a bit thin, which judge which country, you know, the Shire Council of gronggrong has said you shall not route this prefix address kind of goes, yeah, right, you know, so what? [George: Yeah]. So it's kind of hard to set up authority models. And in some ways, you could talk about the inter domain routing space as a cooperative anarchy, [George: yeah], but I don't see it in such a positive light. George Michaelson 5:31 Oh, I still mentally see it as a mutuality Geoff. But you are you heading to a more adversarial state? Geoff Huston 5:39 I'm heading for a more depressed state. I think bear with me. But this is a hard talk, because generally, when you look at BGP, you start rolling out numbers, numbers, numbers, numbers, numbers, numbers, numbers. And I've noticed you can certainly get away with a whole raft of numbers in a learned paper or some kind of, you know, written screed. You could possibly even get away with numbers in a PowerPoint. But, you know, I think it was Stephen Hawking who said you lose half your readers every single time you d wheel out an equation. And he's kind of right, [George: yeah], they drop off very quickly by the third equation, [George: yeah]. And I suspect the same thing kind of holds for here, [George: yeah], that the numbers are less interesting than the underlying story. So I'm going to try and tell the story of just one day in BGP for just one session, but try not too hard to get obsessed over the numbers and talk instead about, you know, trends and patterns and figure out what happened on that day. Okay, [George: yep], so I picked today. George Michaelson 6:42 What day did you pick? Geoff Huston 6:44 The eighth of May? And it's kind of why the eighth and I go, I don't know. It was a week day. I think it was kind of better than a weekend. But beyond that, it's kind of nothing happened George Michaelson 6:54 an average day, nothing in particular, known to be bad in the global network. It's just the day. What happened? Geoff Huston 7:00 No cable snaps, no hissy fits, no cyber attacks, nothing out of the ordinary, cut and thrust of the Internet. Just a day, and I pick one session, and it's the feed to my network. So it's the one upstream, and it's AS4608, which happens to be the code, the number of the network used by APNIC. It's, I think, quadruply homed. It's got four exits. So it's not especially complicated. It's not a major transit for anyone. It's just a sub network out on the end. It's a bit like, you know, the planet Earth is an undistinguished planet in an undistinguished solar system, at the edge of a relatively undistinguished galaxy. You know, there's nothing special about us. George Michaelson 7:46 So we should observe at this point that there's two or three kinds of AS in an informal sense, ones that pass messages between other BGP speakers have that role of providing that transit service. And stub AS's is, which is what we are. We are absolutely on the edge, in a very literal sense, nobody hangs behind us. We are the sole party with routing intent, asserting it out into the world and seeing it coming in. We've got no role in the transit of other people's packets. Geoff Huston 8:17 Well, and in some ways, again, this is entirely average. So first set of numbers, are you ready? [George: Right] in the v4 network, there are 77,000 networks, blobs of connectivity that act as a uniform entity, what we call an autonomous system, 77,000 George Michaelson 8:37 let's just pause the minute. There we have a network that is effectively global in coverage. All of the population of the world that has an electronic device capable of emitting IP packets is now, at some stage in their daily cycle, connected, and it's done by less than 100,000 entities asserting things into the world, in some ways, Geoff. That's a remarkably low number. Geoff Huston 9:02 Well, the telephone network did that with 200 from the telephone network's perspective, that's an astonishingly high and inefficiently large number. [George: Oh], you know, in some ways it's kind of who gave you the license to be an autonomous system? And the answer is, I just said I was, and other people believe me, and that's kind of the way it works in the Internet, a bunch of net, autonomous networks, a bunch of homes, a bunch of enterprises, whatever, have a common provider, common sort of grouping, and that service provider gets an autonomous system number and starts to play a role. So they're about 77,000 of them in v4 George Michaelson 9:42 number 77,000 [Geoff: Yeah], cover quite a range of size, don't they? Some of them are tiny and some of them are giant, Geoff Huston 9:49 from massive to tiny. But let's, let's do the one cut. You're at the edge. You're not anybody else's provider. You're the customer. You're right at the edge. All you. Do is effectively service endpoints of the 77,000 65,000 of them, that's 66,000 if you want to be getting down to it, are just edges. They don't provide service for anyone else. They're just what we call stubs at the edge. So the average network is a stub at the edge George Michaelson 10:22 By country mile, the vast majority. Geoff Huston 10:25 Oh, vast majority. Interestingly, the RIS around 10,000 both do. I originate networks. I'm a bit of a sub, but more realistically, I carry other people's traffic. I I'm a service provider for others. I'm what we call a transit network in a road system. Those are the inter capital highway operators, and so they're actually the minority. Most of the network is just little stubs attached to this common sort of transit network of 10,000 George Michaelson 10:59 I interrupted you earlier. You were on the brink of saying that there might be a difference between v4 and v6 here. Do you want to go back there? Geoff Huston 11:07 Oh, not really. It's just v sixes these days in networking terms, about half the size. So while there are 77,000 stub networks that offer v4 some of those networks also offer v6 but not all. There are 35,000 networks in the v6 network. And again, most of them. 29,000 are subs. And again, around half the number of v4 5000 are transits in six. So if you pick an average network, it's a stub. It's an edge. George Michaelson 11:43 There's more of the edge than the core in this network. Geoff Huston 11:46 Way more interestingly, each of these networks originate around 13 addresses. Each of these networks is, on average, not very big, George Michaelson 11:57 13 chunks of address. Geoff Huston 11:59 Address prefixes, 13 chunks, different chunks of addresses. Most of them originate only one or two, very heavy tailed and a small number originate a huge amount. The record holder in v4 is Amazon AS 16509, it has taken the dictionary and thrown it at the wall and picked up the pieces and it originates 13,283 at last count, address prefixes in v4 George Michaelson 12:28 a prefix is kind of a scale free thing. It's an integral block of numbers that form a sequence, but it could only have one or two things in it, or it could have 16 million things in it, in the current v4 routing model, those blocks aren't size. So when you say Amazon is announcing 13,000 of them, in some ways, it's the history of Amazon being somewhat late to the party, isn't it? Because if Amazon had spawned out of something like AT & T or British Telecom or MIT, it would have had a single prefix that encompassed millions of addresses. But Amazon had to go out into the marketplace and buy little bits and pieces everywhere it could to get 13,000 blocks that it cannot aggregate into a smaller chunk because they don't line up next to each other. It needs those addresses to do its job. Geoff Huston 13:27 I'll come back to that. I'll say that's half true and half not, but it's certainly true that this is not a pre planned system. Interestingly, in the telephone network, local network started regional, national networks follow international networks ensued afterwards, and the way we glued them together was to add prefixes on the left, if you will, George Michaelson 13:51 the front. Geoff Huston 13:52 Well, I don't know what's front and back. You get confused, but you could be telephone address, number two, and I'll be four, and you kind of we group all our mates together, and then along comes another group of mates from a different city, saying, but we have two and four as well. So in a room, we say, well, I'll be city number one, so you're now prefix one number two, and I'm prefix one number four. And the people in the other city over there can be prefix city number two. And then all the cities group together and say, Well, I'm country code 61 and you're country code 64 and so on. So you end up in an arrangement where there are just 200 of these kind of massive prefixes that express entire national domains, and it was all a case of adding and incrementally changing the Numbering Plan by a coordinated mechanism. IP addresses don't do that. George Michaelson 14:48 Oh, The Road Not Taken. I mean, it's so tempting, so tempting to say what a simple world in routing we would live in had we made this decision. Geoff Huston 14:59 Telephone Doesn't know its own number, George Michaelson 15:01 doesn't need to, Geoff Huston 15:02 hey, what's my phone number? And it goes, that's a network property. Dude, go ask the network what my number is. I don't have a clue. I'm just a telephone. And you could change the Numbering Plan without changing the telephones, because the network had the numbers, not the endpoints. One of the cut throughs we did with computer networking, and I possibly going back to the 50s and 60s in design architecture, was that computers were told their own addresses. And so if you wanted to change the address, add a prefix, change or whatever, you had to go to every single computer going new address, [George: yeah], and you sit there and go, Well, that's easy, isn't it? How many devices are connected to the world's networks today? Oh, only a few billion, maybe 30 or so. Easy. not. George Michaelson 15:52 Yeah, we've backed down a narrow pathway of doing things different that is never going to admit a telephone number style, address distribution model without a significant cost that we're simply not going to wear. Geoff Huston 16:06 Right? In essence, the routing system can't redefine the addressing system. The addressing system is sort of, as you say, an artifact of history. A whole bunch of things happened, and folks started buying and selling addresses at one point, just to make it even weirder. And so the addressing plan is kind of a mixed bag, and the routing system just simply has to cope with what it has. It can't change it. So v4 most folk have one or two, but like I said, there are a few outliers with a phenomenal number of addresses. What about v6 that's a new network. We've had the chance to get organized and apply all of our experience in v4 to v6 Yeah, right. George Michaelson 16:48 Oh, come on now, be generous. Geoff. Geoff Huston 16:51 Well, we haven't have we. We've done a lousy job, too. Each network originates around on an average of 6.4 address prefixes per as George Michaelson 17:00 and the maximum Geoff Huston 17:01 6598 George Michaelson 17:04 Wow. So the maximum is actually in the same order of scale as an Amazon, even though the average is somewhat smaller. Wow. Geoff Huston 17:11 AS9808, China Mobile. Again, just a massive sort of heavy tailed distribution where most folk, most folk originate, a small number of networks, one or two, [George: yeah], and some just go no limit, just whatever, and just bang away. George Michaelson 17:27 Most here is the overwhelming majority, and some is the long tail of one or two that just have structural reasons. They behave differently. If you looked in six the typical network is a sub entity announcing a smaller set of prefixes. That's the norm. Geoff Huston 17:45 The majority of networks originate, you know, two or three, as many as 10 and a very, very small number do more than 10, but some of them do enormously more than 10. And the same in v4 very, very small number do more than 10, but when they do, they do a very large number of dress, prefixes, George Michaelson 18:03 right? Short, podcast, good program. We'll see you all next month. That was wonderful story over so not quite is it? Geoff Huston 18:11 It's a big routing system. These days there are 1 million, 1 million entries in the v4 network. George Michaelson 18:18 Sorry, how many? Geoff Huston 18:19 1 million? We finally hit the number [George: 1 million]. 1 million. Wow, 1332 at the start of May the eighth. George Michaelson 18:29 It used to be when we tripped over these large scale changes in BGP, the whole of the operations list would go bananas with people saying, oh my god, I had a limit that said, I've never seen more than 100,000 What am I meant to do? Oh, my God, I'm using a routing box from 1987 it doesn't have enough memory. I haven't heard anyone talking about hitting the million routes mark. It just kind of came and went boring. Geoff Huston 18:53 It's not boring. It's actually quite difficult, and it keeps on growing so very quickly. It grew on the eighth of May we added 674 new v4 prefixes net. So at the end of the day, it was 1,001,332 at the end of the day, it was 1,002,006 George Michaelson 19:12 600 a day. Geoff Huston 19:14 We keep on adding, adding more entries. George Michaelson 19:17 So on a typical day, a typical random day, nothing special. 600 new prefixes come into the BGP system. Geoff Huston 19:26 V6 a mere 221,004 unique v6 prefixes George Michaelson 19:32 that seems at scale, a very much larger number relative to the size of prefix total like v6 is growing quicker. Geoff Huston 19:42 Yeah, but v6 is the addresses are four times the size. So oddly enough, the table's still slightly lower. It's 221, times four, 880,000 versus 1 million in v4 so in memory size, they're comparable. And the v6 network grows, and you kind of go, Well, surely it's growing faster. Yeah, no, it's not growing slower. 140 new prefixes that day. So you go, well, it's not a problem. We've coped with that before. Nothing to see here. Let's move on. Hang on, apply the breaks. Now, a million entries our binary search algorithms, no matter how good you get, typically work in well, it's a binary algorithm. You halve the space for every look up. So if you have a decision table, the first thing you do is go, is it at the top half or the bottom half? You go, well, pick one of those halves. Is it in the top half of that or the bottom half? You're dealing with quarters 8th 16ths, it will take you 20 look ups on average to find what you're looking for in an ordered table of a million entries. George Michaelson 20:43 So the number of operations you have to do to find a unique prefix in the set of all prefixes keeps on getting more expensive in time and effort. Geoff Huston 20:55 Well, it's 20 you know, when we only had in those halcyon days, halcyon days of, let's say, 4000 entries. And I'm going back to, I don't know, 1988 or 89 you could do that in, well, 12 lookups. But now we're at 20. Now you kind of go 12 to 20. Not a big issue. Yes, it is, because memory is not getting fast. Memory and silicon has been the same speed for over two decades. So in some ways, if I'm now dealing with petabyte capacity, and I've got to spew packets out of the cable and a few 100 million per second, and to figure out what to do with these packets, because this is hop by hop, destination based forwarding, yes. So I take the destination address, yes, and I look up the routing table to figure out where to send it. And a very small network, it used to take me 10 cycles to actually do that. That decision in a table today, using a binary lookup, it takes me 20 George Michaelson 21:44 And the component you're looking into memory, unlike CPU, speed has not got significantly faster, Geoff Huston 22:06 not getting any faster. George Michaelson 22:07 Look up is somewhat constant per cycle. So Geoff Huston 22:11 the challenge down at the silicon level, by these larger and larger and larger tables is not diminished. It's a very, very unique challenge to try and maintain line rate speed in packet delivery, because packets aren't getting any bigger. We've talked about this before. Packets are still the same size. But as you up the transmission clock rate, as you increase the number of packets per second, and you increase the size of the decision table, you're kind of putting more and more faith in the fact that we can do clever tricks in silicon to fake out the fact that this is just not getting any faster. In fact, there's nothing else it gets slower. So it's quite a challenge to design basically ternary content addressable memory these days that can give you line rate across such a massive decision table, George Michaelson 23:02 only we come up with a way to use the Flow Label mechanism that's built into the packets to actually construct some form of on the fly routing signal to do this more effectively after the first one or two packets. But that's just never worked. Never worked. Geoff Huston 23:20 Oh, hang on. Hang on, George. Oh, if only we had reinvented virtual circuits, how much better the world would have been. And all the ghosts of the telephone industry go, Aha, the packet guys have finally met their nemesis. We were right all along. Yeah, you're not. You know, there was a big argument at the time, and it is a really fascinating argument in design technology as to whether you pack all the goodies into the network and make the network the focus of attention, or you pack all the goodies in the edge and strip the network out of anything remotely resembling intelligence or even investment part of The cut through in the original takeover of packet based networking against circuit based was the fact that silicon at the edge was cheaper. Silicon in the middle was a real problem, and it was kind of cheaper to build fast, dumb networks and repair the mess in the computers in either end. And that's why packet based networking gained ascendancy. It was just cheaper. It just offloaded cost. George Michaelson 24:23 Yeah, cheaper doesn't get you faster all the time. I mean, when we were still within the bounds that the increasing speed of CPUs gave us a win, all well and good, but you've just told the story that part of the game has somewhat diminished in its ability to deliver an outcome here physics is physics and memory isn't faster. So if we keep on growing, things are actually always going to be getting a little more costly and slow to compute. Geoff Huston 24:51 Yes, if you continue to build faster, wider super highways and pack all the traffic into a small number of trunks. Of course, that's not the only way you can do networking. And you know, there are many ways to get around this problem, and one is disperse the traffic, reduce the concentration points, push stuff out towards the edge, and Oh, guess what we've been doing. CDNs, avoid this entire problem by pre delivering the content to the edge and actually getting around the bulk of the problems we're creating in the middle. We don't use transit super highways the way we used to, and that's our current way out. So I'm not sure the problems and solutions are as simple as you're making out. And there are many ways to solve this problem, but I suppose I just wanted to point out that inexorably, the routing table grows, and that growth has a number of impacts, including, if you will, the impact on switching ability down silicon level. George Michaelson 25:50 Yep. So that's one dimension of growth you've been talking there are other ways you've been looking at this typical day? Geoff Huston 25:57 The next thing I want to look at is this weird thing about what is the role of a routing system? Oh, George Michaelson 26:06 do the least work possible? Geoff Huston 26:09 That's possibly it. But, you know, the role is to know where everything is. You know, there's no external view of the network. No one's out there doing the cartography of the Internet. There's none of those wonderful 15th century maps of the Earth, or even earlier. You know Jerusalem in the middle. You know North is heavenward, South is onwards. George Michaelson 26:29 We've both seen some rather beautiful maps from tele geography, but they're much more about the lay of fibers in the physical world. There isn't a good equivalent for the Internet component. What there is is a blob that looks rather like a weird, amorphous space brain floating in the black void, which is the map of what actually is the connectivity, but you couldn't steer your ship through it. Geoff Geoff Huston 26:54 Right. The space brain is kind of the best we've come up with. But the job of the routing protocol is to take each of the nodes, the individual networks, and let them be informed of their connectivity options. What do I mean by that? Well, let's say you have a network that has three external interfaces, connections, three links to someone else, and you've got a packet. God knows how you've got it. You got given it from on high. Here's the packet, Dude, get rid of it. Which of the three do you choose? The left, the right, the middle. How does it leave my network to get closer to its intended destination? Well, that's the job of the routing protocol. Because what the routing protocol does is it informs every network of which addresses are reachable and which are duds. The duds you just throw away the packet I can't reach there. Goodbye. And if they are reachable, if you have a variety of external choices to use, it informs you of what I say is the local best choice. Use the left interface for this packet, use the right one for this address packet and so on, which is consistent with the decisions everyone else is going to make to get that packet to its destination. George Michaelson 28:05 If you think about this as a router with three hands, Martian router, your decision of the million prefixes coming in all has to reduce down to left, middle, right. [George: Yep], that's the only choices you've Well, four choices, left, middle, right, throw it away. So all of this information has to map into pick one of four things. If you've got three choices, Geoff Huston 28:29 if you pick left and the network to your left has picked you, [George: oops], I've all got a problem. Me, you, me, you No, no, it's yours. No, it's mine. No, it's yours. Stop. You have to make a decision that's consistent with everyone else. And so the job of the routing protocol is to sort of push information around the entirety of the network. So we all make consistent choices that every time we get rid of a packet to a neighbor, it gets closer. That's a very deliberate choice, closer to its destination, [George: yeah] because if you then think I'm closer, then you and I have a different view of the underlying world, and that's bad. That's a loop. George Michaelson 29:11 So closer in this discussion is a somewhat undefined quality. Geoff Huston 29:19 Think of it as packet miles, but it's not really, but it is closer. If everyone makes a forwarding decision that gets the packet closer, then Zeno's paradox, not withstanding it'll get to its destination eventually, because, you know, closer eventually gets to it's beside me, I've arrived. And so the first job of the routing protocol is to keep the network together, to keep everyone with the same view of the world the same sort of abstract map. Now they don't need to know, and perhaps can't know, exactly where this packet is going to travel. This is not, if you will, a future prognostication Protocol packets aren't bound to circuits. Everyone makes independent decisions, but what it's trying to do is to reach a local decision that it believes is consistent with the future decisions that are going to be made about that packet as it goes through other networks. George Michaelson 30:14 This is Byzantine in the truest sense. You're not allowed to affect the future fate, but you're meant to get enough inference from the information coming in that you can at least try to do better when you push it out Geoff Huston 30:29 exactly. And I suppose that's why routing has been such a fascinating and such a difficult problem for a very, very long time. Because, like I said, there's no external agency drawing the map for you, there's no configuration tables. You've got to talk to each other and mutually discover how this universe is put together. And you know the idea that there's no external help, you're on your own kiddies talk to each other and work out the topology of the network, that you all have a consistent view which local decisions bring packets closer to their destinations. Fair enough, it's the topology maintenance, protocol George Michaelson 31:07 topology maintenance. Here's the landscape out yonder with the Internet weather report. Geoff Huston 31:14 Well, yes, and quite frankly, leaks are not permanent. Sometimes they go down. Catastrophic events happen. People lean on the plug. Someone leads on the power switch. You know, things happen and breaks happen, and oddly enough, you want the system to self heal. So, oh, there's a break to ticket. Is there an alternate way to get towards B? Because I've got a packet address to B. So topology, maintenance, not not just construction, but keeping it together is its primary goal. But that's not enough. George Michaelson 31:48 There's something in that that is just it's eating at my brain as an ear worm. Something breaks over in Türkiye because of a local matter with a backhoe in a farmer's field and me as a stub network down here in Australia, I get told it and I have to try to optimize anything from me heading that direction in a belief I should heal it somehow. This seems very strange. Geoff Huston 32:17 Well, the issue is, let's go back to my three arm router. I might have been using the interface on the left to reach this field in Türkiye. And when something breaks because of a catastrophe in Türkiye, it's not that it's unreachable. It's just that's unreachable on an entirely different set of destin you know paths. It's a different thing, and I should have to go to the right, even in Australia, [George: right]. Maybe I was going up through Asia, and all of a sudden I have to go, Wow. I can only get to Türkiye through America. I need to go to the left or something similar. So it doesn't it's not as bizarre as it might sound in first instance, that you need to propagate this stuff in order to make sure that everyone is making consistent decisions locally. George Michaelson 33:03 Every BGP speaking router is both receiving this update about the connectivity issues in the global space and adjusting its own idea of the best forwarding in the light of the circumstances as it sees it. Geoff Huston 33:18 Everyone now, if that was all we were doing. This network would be pretty crap, because BGP doesn't find you a bunch of paths. It finds you the best path. It actually coagulates. It coalesces, it sends everyone down the same path at some point, because there's only one best path. But what if you don't have enough capacity? And even though there's an alternate way of getting there that's slightly longer, BGP goes, no, no, I give you the best path, [George: right] The best path, singular George Michaelson 33:53 best in a dimension of a selection for shortest, irrespective of any other consideration of congestion, behavior, consequence. Geoff Huston 34:03 I'm giving you the best path, and you kind of go, but I want you to be cleverer than that. I've got a mesh of global connectivity. I just don't want to melt the one path between New York and London or whatever. I don't want to do that. I want to splay my part my traffic across all the paths. I want to do what motorists in cars do when there's congestion on Highway One. Some of us peel off and take Highway two. It might be longer, but oddly enough, if Highway One is suggested, it will be faster. So you want to do the thing that's actually, I suppose, hardest to describe. It's called traffic engineering that you want to not only maintain the best paths across this network. You want to actually maintain a range of choices that later do what we call traffic engineering that allow me to say, well, you can reach me on Highway One, but you know, quite frankly, the highway two and highway three are also an option. George Michaelson 35:00 But BGP is not written in a way to make that easy. Geoff Huston 35:03 Oops, it's not. And so what we do is all kinds of wacko tricks. And part of the issue is too that forwarding is a two part problem. I have the packet I want to get it to you in a way that minimizes my economic cost. You want the packet and you want me to make a choice as to how to reach you that minimizes your economic cost. George Michaelson 35:30 And the two may not line up. Geoff Huston 35:33 You might want me to use transit path A, but I want might want to use transit path B. Somehow we have to reconcile these two so that, you know, we reconcile this, that someone needs to win and someone needs to lose. Now, there is an odd trick in routing called more specifics. So I'm going to just briefly digress and say you don't just advertise an address prefix and go, Well, it's a country at +61 well, it's, you know, +64, etc, etc. Every bunch of addresses has a size. We commonly use this weird thing called prefix notation, which is bit length masks. And I'm going to stop there because, you know, brains start to melt. But the issue is, you can advertise a big block of addresses, and you can also take out the cleaving knife and cleave it into smaller blocks and advertise a whole bunch of smaller blocks of addresses. And if you're so inclined, you can do both. Now, the routing system prefers the more specific the smaller blocks. George Michaelson 36:33 So the best decision includes, if you were told this is a big chunk, but you were also told the specific thing you're looking for in a smaller chunk. Prefer the smaller chunk. Geoff Huston 36:45 So what if I want to bias my traffic to have most of it coming through in circuit A, but have a small amount coming through inbound circuit B, well, I advertise my large block of addresses down circuit A. Hey, this is all my addresses. You can reach me here, but as well, I slice and dice a few more specific address blocks, and say you can only get to these specific address blocks through circuit B. Now at the other end, it goes, Oh, more specifics. Win every time I can see this aggregate, and I'm forced to use circuit A, but there's these weird little addresses that are using circuit B. I have a packet address to one of these weird little address locks. I have to use circuit B. And so people play this game like crazy. When we said that there are a million entries in the v4 routing table, what I didn't say is that there are around 600,000 entries that actually aren't new. They're more specific. They're doing the address traffic engineering. George Michaelson 37:49 So more than half of the surface of the count of prefixes, more than half of it is people optimizing in this kind of way. It's not strictly necessary just to reach them. It's to reach them the way they want you to come in that is optimal for them Geoff Huston 38:09 to minimize their cost. Networks cost. It's all, you know, fair in this game. And so, yes, more than half of the advertised prefixes in v4 and v6 actually don't define reachability. They refine the choices that are being made to bias the folk who are sending them traffic to go down paths that suit their incoming policy. And BGP carries that load. You know, it continuously has got this sort of background hum of activity around refining how to reach you, particularly where more specifics are concerned. George Michaelson 38:46 So you say background hum, but you actually haven't yet said anything about chatter, for how much of people saying things? Because surely, in BGP, once you've said it, you're not shouting it every day. It's said once and then never again. Geoff Huston 39:01 BGP is it uses TCP, a reliable transport protocol. So it's like a conversation where whatever I tell you, I assume you're going to remember it forever, and only if I tell you something different do I need to update what I told you before. So BGP is a very, very economic protocol because it uses TCP. When I tell you, I know that you know you've acted and as far as I'm concerned, that's it. You now know it. So the only thing I have to tell you is what changes. I only have to tell you when things are different. I don't have to say, yes, you can reach network one through me. Oh, you can reach network one through me. Oh, by the way, did I tell you? You can reach network one through me? BGP doesn't do that. Simply says, I told you once, do that's it. I'm not going to tell you again. You're expected to remember what I tell you. You're a computer. So on an average day, when nothing happens. Nothing. My little sub router connected at the edge received in v4 722,000 address prefix updates. George Michaelson 40:11 Wait. You said nothing happened, and you got 700,000 updates because nothing happened. Geoff Huston 40:18 8.3 updates a second of a v6 much better behaved. Not 270,000 v4 v6 prefix updates per day, or 3.1 updates a second. George Michaelson 40:30 So Geoff, you said, like three paragraphs back, very efficient. Protocol uses TCP. If I told you once, I don't have to tell you again, that's a lot of chatter for not much happening. Geoff Huston 40:43 There's an awful lot of chatter. And it's kind of, this isn't just a little bit of background hum. It's like me shouting into this microphone, and the shout overwhelms the signal. It's kind of, the noise is almost everything, so a small number of times. Oddly enough, instead of two or three updates a second, there are a couple of seconds on that day where I got 800 updates per second held steady across a five minute interval. It wasn't just a little blip, it was sort of Wow. Let me just shout at you as fast as I possibly can and send you a few 1000 updates as quickly as I can, because this is stuff. I think you need to know whether you want it or not, George Michaelson 41:25 But the way you've described BGP, except for in the very specific circumstance that someone you talk to directly, a neighbor, appear in BGP terms, except for in the circumstances where they're the one that is emitting these messages. If you see it, every other BGP speaker saw it as well, necessarily. Geoff Huston 41:48 So. Okay, let's now sort of go through about why do you say something in BGP? And so, as I said, on the whole, there are about 700 new announcements, announces in v4 there are about 200 in v6 and when a network appears that wasn't there before reachability, that's what we call an announce. 47,000 announcements were made in v4 on that day. Obviously, some of them got withdrawn and announced and withdrawn and announced, because these are computers. But you know, 47,000 announcements in v6 It was half that. 28,000 so around rough figures, around 10% of the traffic is Hey. This is a new prefix, wake up. This is reachability. You didn't know the second before I told you George Michaelson 42:37 which to put a value judgment on. It is necessary, and if you like good information, [Geoff: right] you have to hear that. Geoff Huston 42:44 You had to hear this. And the other thing you really should hear, because otherwise, packets, you sort of drift in this nowhere land, is that network, that address prefix can't be reached no more. And interestingly, announces don't equal withdrawals. The withdrawals I can't get there anymore is around about a third the size. George Michaelson 43:03 In some ways. That says things in BGP tend to be stable. You don't wind up having as many things disappear as add. It's growing more than it's shrinking Geoff Huston 43:13 At least in v4 in v6 28,000 announcements, 25,000 withdrawals. Up, whoa, down, up, down, George Michaelson 43:23 whoa, that's huge. Geoff Huston 43:24 It is just unstable configurations where you think it's not reachable, but it's reachable, then you think it's not, then it is, then it's not. And so a certain amount of BGP traffic is unavoidable, and simply all about changing reachability, and some of that change is permanently unstable, okay. What about all the rest? You see the way we prevent loops and measure things in BGP is it keeps what you call a snail trail of where that update has been, so that it ever loops. You can see the loop because the snail trail, the list of networks that have handled that update has you in it. I've seen this before. Well, throw it away, dude. It's a loop, and a huge number of updates. More than half is actually refining that path. It doesn't say this is new reachability information. This is not a prefix that either is or is not there. It's always been there, but it's refining the way that that update is traveled to get to you. George Michaelson 44:29 So the way things travel sounds to me as something that is closer to the traffic engineering behavior space of BGP than the fundamental routing property of BGP, if it's happening to change path, does that mean it's necessarily happening because people are fiddling with the speed dials, Geoff Huston 44:48 And you can think of a kind of topology where you might have a different way of reaching that destination you might and the only way that I'm going to trigger that is to say that the existing way you're using isn't usable anymore. BGP doesn't understand the topology that it's maintaining. It just sends the atomic operations. So when a path changes, when the inter AS connectivity of the propagation of that update changes, you have to push it around everywhere, and this accounts for in v4 more than half the updates are path changes. In v6 two thirds of the updates are these kinds of path changes. They don't reflect increased or diminished reachability, but they refine the choices being made on what we'd call the best path. George Michaelson 45:42 So they might functionally alter, Geoff Huston 45:44 they might George Michaelson 45:45 my local routing decision, because, as you've explained before, it could wind up altering my choice of best path. But they don't alter the overall surface of things I have to have in memory as a map to find things. Geoff Huston 46:00 And then we come to the last classification of updates, which is what we call attributes. Because, you know, we're never happy with things that are simple and easy, and adornment is everywhere, and folk wanted to adorn BGP with overlays of what we call commentary. Hi, I'm a really good provider in Asia, reach me for Asian destinations. How do I say that in BGP? Oh, good question. I don't know. Well, let's define a sort of a space, and let the provider say, if you send me a BGP update and it contains this digital tag, this attribute, I will do something funky with it, just for you, bizarre, but so useful. George Michaelson 46:46 To come back to a thing said at the beginning, we're not the routing police. Is there an attribute police? Does anyone say what these attributes mean? Or is it private between you and me? Geoff Huston 46:58 Every network defines its own attributes. George Michaelson 47:01 Oh, my. Geoff Huston 47:02 There are a few that the IETF did define, NOEXPORT, even a bizarre one called NOPEER. There are a few that are defined by standards, but everything else now, each provider can, you know, divide up this digital space, which I think has now got to about the 64 bits long, it's quite big, and say, Well, if you send me a route with these attributes, I will do the following things. And interestingly, in v4 30% 1/3 of all the updates a change in the attributes of a route. It's kind of adornment change. It's nothing more fundamental than that. George Michaelson 47:37 So some BGP speaker, and I'm going to go out on a limb and say, probably one that has a direct relationship at most, one away from that source changes their behavior, because this adornment, this attribute, does something, but you Geoff as a sub network, far far away in, far or faroffia, that attribute Change isn't going to alter any decision you make about how things are done, because you don't know what the magic code means. But nonetheless, you were told it Geoff Huston 48:08 Because BGP doesn't know if you have any capability to alter your behavior. [George: Wow]. It just says, I'm the messenger, dude, don't shoot at me. George Michaelson 48:17 A third of all updates are changing an attribute which can only be meaningful to a limited subset of BGP speakers. Geoff Huston 48:27 In v4 less than v6 we don't use attributes as much in v6 it's a toy protocol. It's not serious. So last and not least, let's kind of move on here. You kind of think, well, this is like Brownian motion. [George: Yeah] you put a sample of water under the microscope, and all the molecules are moving all the time, no, no, no, no, no, no, no. There's a small class of bad networks that are just radiating noise, as if it's going out of style, and everyone else is kind of silent. George Michaelson 48:59 So this behavior isn't equally distributed amongst all the BGP speakers, the emitters of this information clusters. Geoff Huston 49:08 I said there were 77,000 aeses in v4 but only, only 24,000 contributed to updates as origins in that 24 hours, the other 54, odd 1000 was quiet as mice didn't do a thing. And you go, Well, okay, the top 10 ASs in v4 the noisiest accounted for 14% of the total update volume. And the biggest one was the university academic and research network in Mexico, 4% of all the updates I received concerned that university network. I'm in Australia. It makes no difference to me, but I'm hearing in great detail. George Michaelson 49:50 I would guess that if they're not a complete stub, they're pretty Stubby. They might have a small number of BGP speakers downstream, but they're basically dealing with their own academic network, right? Geoff Huston 50:00 But they have a lot of networks inside them, and they have a number of transit choices. ASs is 174, 1299, 2914, and 6453, and what they do do is regularly through the day, they take their 12,067 v4 prefixes and move them to use one of the other transits. As soon as they do that, 12,067 updates, Ka bang, then they do it again. Ka bang. And it's kind of every time you update 12,000 prefixes. George Michaelson 50:33 Oh, wow. Geoff Huston 50:33 Thanks, everyone. George Michaelson 50:34 Could we use the attributes? NOTHANKS. And sort of, this is something that nobody else has to see Geoff Huston 50:41 well, you think this is bad. V6 the 10 most active origin as networks, accounted for one half of the v6 updates. George Michaelson 50:51 Half come from 10 speakers, Geoff Huston 50:54 and 20% came from one as 151194, thank you very much, George Michaelson 51:00 mate, mate, get on the phone, pick a phone call and say, What are you doing? Geoff Huston 51:07 Don't just don't stop it. Whatever you're doing is wrong. And this is my earlier observation. There's no intelligence test, no competency test, to enter the BGP network, as you know, you find all kinds of stuff out there from all kinds of folk who have no idea of the consequences of what they're doing. So, you know, yes, it's heavily skewed, [George: wow]. And a small number of folk are being sort of massively noisy citizens out there. And in some ways, you'd really love to scream at them, saying, Stop it. But when we tried that route flap damping, we kind of shot our own foot off as well. And in the closing sort of bits of this, what have we learned about this day on the Internet? Then it's pretty clear that BGP does one thing brilliantly, one task in me really, really well. It's the best flooding protocol we know, if you want everybody on the network to know something, put BGP to work, and it will tell everyone really quickly, really efficiently, George Michaelson 52:10 very good at telling people, but is what's being told? Functionally useful? Geoff Huston 52:16 Ah, that's that's the next thing. But if you ever wanted to flood something RPKI, people in particular, go use BGP, because any other attempt at flooding is an incredible disaster. I'm going to poll you. Until it changes, I'll go away that's just thrashing. Oh, you're going to tell me. I'm going to tell you a billion things about you know, when something changes, that doesn't work either. BGP is really good at doing one thing, and it's brilliant at it, no wonder we all use it. But everything else, it does really badly, really, really badly. Again, let me talk about our little network in Mexico that's busy flapping around with four providers. And what it's really saying is, look, if you can get to any one of those four, they know how to get to me. So I actually don't need to tell anyone else in the world about my connectivity. Those four folk need to know that they're my upstreams, and what everyone else should know is ignore me. Just, just don't propagate out changes. How do you say that? I've just told you in English, but that's not telling you in BGP. I've told you a policy, but I can't express it in the language of BGP routing. George Michaelson 53:28 You would think something like NOPEER or NOEXPORT had some quality that would allow the upstreams to say, I'm not passing this on. Geoff Huston 53:37 Well, no export is pretty brutal. I'm going to tell you something, and you can't tell anyone else. And that's not what I said. It's more absolute than I wanted. I want to tell some people, and I want those folk to know it within a certain context, but I don't want them to shout it at the world. So NOEXPORT was kind of too restrictive. NOPEER was kind of better. It was an attempt, I think it's about 20 years ago to say, I'm going to make some noise and I'm going to pay you for it. So if I'm your customer, I'm going to pay you extra and send you a whole bunch of routing noise. And here's the money to compensate. But as soon as my money runs out across a $0 peering relationship or a downstream provider customer I'm not paying, don't send it on. It's irrelevant. NOPEER had a better hope, but no one took it up. No one, no one uses it. It's not there. So, you know, even some attempts to try and create a rudimentary language of address policy has failed. [George: Wow] transport policy has failed. In BGP, we can't contain the blast radius, you know, tell your neighbors, tell everyone in this concept of an area. Is that a geography? No, it's a network topology. How do you describe that with a color? You know, at this point you're scratching around going, Oh, I don't know what I'm saying. George Michaelson 54:57 We don't have a language in BGP. It to help. It to make this kind of scattergun blast radius of a consequence reduce, but it's enormously good at shouting it to the world, great at Flood Fill, but not necessarily good at limiting Geoff Huston 55:14 on an ordinary day where absolutely nothing happened, no circuit outage, no cyber attack, no nothing. We still got a million updates, and every BGP speaker in the world was processing a million updates. It's kind of bizarre. Now, the thing is, why don't we fix it? And you know, surely we're clever enough, we've had enough experience to say, Well, okay, let's attack this. Let's go and look back again at the BGP policy landscape and try and fix it. Now, there's a certain amount of cost versus benefit going on, and the real question is not, oh my God, that's a problem. We have to fix it, because that's kind of maniacal. That's the obsessive compulsive element of the IETF that says, oh my god, George Michaelson 55:57 it doesn't look perfect. Therefore we should make it more perfecter. Geoff Huston 56:01 Press the panic button and make it perfecter Absolutely. You know, at some point you learn to live with stuff, because the cost of finding a solution is way higher than just simply living with it and paying the cost of the inefficiency that's there. So BGP at a million updates a second, is not causing meltdowns. It's not making all the routing systems in danger of Internet collapse. I'm like, It's okay. We can live with it. George Michaelson 56:29 This is the tolerated level of noise and chatter that achieves the outcome we want, which is efficient routing. Geoff Huston 56:36 Now occasionally folks reach for the big red button and press it. Oh my god. Routing systems going to melt. Movie at 11, blah, blah, blah. And if you happen to look at RFC 4984 from 20 years ago, the 2006 IAB workshop on routing and addresses, you'll see the panic button being pressed. It was sort of all. Routing is growing faster than Moore's law. It's all going to die very, very soon. [George: Yeah] we have to do something dire predictions, etc. But the reality has been a bit more mundane than that, that in general, those predictions haven't eventuated, and we've found cost effective solutions that don't require rebuilding the inter domain routing system from the ground up. Now the other part of this activity is the bells and whistles department of the IETF, the complexity factory. I want colors. I want VPNs. I want multi homing. I want service differentiation. I want. I want. I want, I want. [George: Yeah], but the issue is, unless the critical enough number of people are prepared to pay money for it, the vendor goes, [George: not happening] with it, dude, not happening. We're not all going to do this. And a solution that works just for Geoff's routers, you know, in Geoff's networks is not a global solution. It's not even a solution at all. It isn't there. So the threshold to make everyone do something is actually quite high, and we tend to do lowest common denominator featureism, George Michaelson 58:03 the bare minimum necessary to make it work Geoff Huston 58:06 That everyone's willing to pay for, and if everyone isn't willing to pay for it, guess what? Doesn't happen. The insufficient number of folk want it. Quality of Service, routing ain't gonna happen. Integrated routing, security ain't gonna happen. What do you mean? We have RPKI? No, you don't. You have managed filter lists. Oh, is that secure routing? No, it's managed filter lists. Why? George Michaelson 58:30 It's what you've got, Geoff Huston 58:31 It's what you're willing to pay for, dudes. That's all you get if you want something better, you know, more complex, more difficult to operate, be prepared to pay a huge price. Are you prepared to do that? No, you're not, because all the investments moved up in the stack into applications and content. [George: Yeah] So oddly enough, after this day, what you get left with is that the day looks much the same as all the other days over the last 40 years, all of them. It's a not special day, because every day is like that in BGP, there's this sort of tolerable amount of inefficiency and overhead, and it's tolerable because it's not causing meltdown. Yeah, my router is doing a whole bunch of work for networks in China and Mexico that have not changed my local forwarding decision one iota. [George: Yeah]. This was all just, you know, exercising the process, because it had nothing better to do that microsecond. But it doesn't matter. I'm willing to pay that price because trying to make it different will cost so much more. So it's a distance flooding protocol algorithm that does simple best path selection. It's fantastic. For everything else. It's pretty lousy. It's really bad, but it's okay. It's okay because it's not affecting its primary goal. [George: Yeah] it's good enough. And so BGP, I wouldn't describe in superlative terms, as the best ever routing protocol known to man. Time it's not, George Michaelson 59:55 it's the good enough protocol. Geoff Huston 59:56 It's the mediocre, good enough protocol that does it's job tolerably well, and that we understand enough of it that we haven't blown it up yet. And the good news, we're unlikely to blow it up tomorrow either. George Michaelson 1:00:08 If we don't change anything, Geoff Huston 1:00:10 keep your hands off that button. George, yes, don't change anything. George Michaelson 1:00:15 Geoff, that's been fascinating. Thank you. Geoff Huston 1:00:17 It's been a pleasure, George. And yes, I look forward to our next conversation. George Michaelson 1:00:23 If you've got a story or research to share here on ping, why not get in contact by email to ping@apnic.net or via the APNIC social media channels. Also remember the measurement@apnic.net mailing lists on orbit is there to discuss and share relevant collaborative opportunities, grants and funding opportunities, jobs and graduate placings, or to seek feedback from the community on your own measurement projects, be sure to check out the APNIC website for All your resource and community needs until next time you