Geoff Huston 0:00 Well, that was always a thought, always in the design of packet based networks, because once you drop the idea that the network is the arbiter, the rationer, the police of using the network, then how do you signal to a sender? "Woah, buddy, that's unfair". And so you've got to actually drop further back into going. I have this network full of senders and receivers, and some senders are receivers at the same time, you know, full of a bunch of people producing packets and consuming packets. What am I trying to achieve? Does everyone get 64 kilobits per second of Clear Channel? Of course not. You get what you need. But there's a couple of overriding tenets: Firstly, everyone should be efficient. What does efficiency mean? Don't leave the network idle if you have things to send. George Michaelson 1:03 You're listening to ping, a podcast by APNIC discussing all things related to measuring the Internet. I'm your host, George Michaelson, this time I'm talking to Geoff Huston from APNIC labs again in his regular monthly spot on ping. Geoff, and I talked about his recent trip to AUSNOG in Melbourne, and a talk he heard there about optimizing throughput and fair shares of a link in the modern fiber optic dominated high speed network of routers and switches. He's been looking at TCP flow control models for some time and recently testing BBR in a variety of ways, and the talk reminded him of an older IP layer model of Explicit Congestion Notification, or ECN. ECN is baked into the IP header and returns signals inside TCP, but it isn't proving viable in the wild. The idea of sending back a signal you're pushing too hard isn't exactly new, but as Geoff has learned, the mechanism isn't working out as well as we might hope, and so protocol tweaks like BBR, which is now in its third iteration, have to try and do a better job of both bandwidth estimation and managing the competition for space in order to be fair. Geoff, welcome back to ping. What should we talk about this time? Well, George, Geoff Huston 2:28 I've just come back from the annual Australian network operators group shindig down in Melbourne in September and start of September, and it was a great couple of days. And you know, if anyone's cruising around Australia at that sort of time of the year and has nothing better to do for a couple of days, I'd certainly recommend going to drop in on AUSNOG. It is great. George Michaelson 2:49 Oh, you got to be quick. Tickets go like that. [click] They've gone in a flash. Geoff Huston 2:54 They've been sold out for the last 18 years, which is amazing. Just Sold out. Very popular Anyway, yes. And I was talking about the evolution of transport protocols, and I kind of thought, well, you know, unlike other kinds of topics that one could talk about, like V6, why bother? Or RPKI, offers about as much protection as a an armor plated suit made of wet lettuce leaves. You know, I thought the transport protocols was, it was a much tamer topic, without the controversy that these other ones breathe. George Michaelson 3:25 Oh, you say that, but I have a feeling it'll turn out not to be the case. But Pray continue. Geoff Huston 3:31 Well, I spoke just after a gentleman from from Cisco who was hawking a different kind of transport system based on, effectively, MPLS, source based routing bits of V6, all going by, oh, and RSVP, I think, came into it. It's sort of all the bits that are just weird, [George: right] And I'll tell you what, other than the names of the technologies. This was a talk from the 1970s George Michaelson 3:59 this idea in some air quote sense better. Geoff Huston 4:03 Well, it's also an idea that, like speaking into a telephone handset. Remember them, the application is inelastic, dumb and invariant, and will not adapt to any conditions in the network. You know, I've yet to meet those applications. And secondly, then it's left to the network to effectively behave smoothly and consistently and not muck around with the explicit timings coming in from the application traffic. George Michaelson 4:36 Well, we've had an idea of Smart Edge dumb core as a kind of concept in constructing networks for a long time. I wouldn't say it's baked into the IP protocol, but the idea was, you wouldn't have massive intelligence in the core of packet forwarding. You'd try to keep that to a minimum. Geoff Huston 4:53 Well, that was the whole thing. If you have computers mediating the application workflow, they're. You don't need to blindly try and preserve absolute adherence to synchronicity. You know, it's not just a telephone. It's a telephone going through a computer. You quantize the voice signal, you chop it up into packets, you send it out as packets. But the receiver, also, a computer, is not just a speaker. It firstly goes into an input queue and reassembles, if you will, the analog signal as it assembles out of the local buffer the incoming packets and then reproduces a constant voice signal. You and I are having a fine conversation, and dear listeners, you're listening to this in real time, but there's no real time circuitry. There's no we've carved out a channel. It's got absolutely zero jitter and high fidelity. It's exactly the same as telephony. Not at all. [George: No] you're just running George Michaelson 5:53 packets. No, this is really quite different to the longer held view from old times about the nature of communication with the right kind of error correcting and the right amount of buffering, which, of course, implies delay, but nonetheless the right amount, it's possible to use packets that have no deterministic behavior to get remarkably fine outcomes. It can be done. Geoff Huston 6:17 What Not only can be done, it's the default of what the entire world does today. George Michaelson 6:22 So presumably you were talking, in a sense, saying, well, all this complexity of MPLS and SRV6 and RSVP, there are other approaches. So you have an idea of a simpler approach to signaling behavior, Geoff Huston 6:37 reproducing an IP advocacy sort of line from the early 1980s George Michaelson 6:42 Oh, you've come from the 70s to the 80s. Geoff Huston 6:45 I come from the 70s. Yeah, no, I mean, the issue was that you didn't need to put all the investment into the network to cope with dumb devices and dumb applications. If you have intelligence at the edge, you can actually manage a much simpler network based on datagram switching, where data can be lost, reordered, even replicated, and it's up to the ends to actually recreate the original stream. And as long as you adequate enough network, you can even recreate the implicit timing. In other words, voice and video works simply because you've got smart enough edges and enough capacity in the network. You don't have to jam you don't have to jam the network full of expensive equipment, heaps of software, masses of complexity, and pay an enormous price to be able to talk through an IP system. George Michaelson 7:40 At one level, you've shifted where the cost is, because the fact remains, you have to do work. Work has to be done, but you're saying we have the computers anyway. We want the handsets to be smart. They have to have the processing capacity. We want the ability to do the work, and it's already implicitly available in edge devices. How about we use that to achieve an outcome? And the second thing you said is enough bandwidth so it isn't as simple as we can fit this down a thin piece of string. You do have to say, why are we constraining the world to BE thin bits of string? How about we treat the world as having a lot more bandwidth than we used to think was the limit, which, by the way, is pretty straightforward to do now, isn't it? Geoff Huston 8:25 Well, that was the second part of the story. So the first part is, you don't need to stuff the network full of extremely expensive technology to cope with dumb applications. My phone wasn't doing anything my smart device until I started talking. It's not as if it's lacking processing capability. The marginal cost of having it do some work in terms of adaptation of transport is zero, whereas in the network, there's lots of things going on. You know, it's actually expensive to do this. And so the first thing is, you're spending the wrong money in the wrong place. But there's a different thing that's also happened, and I blame Moore's Law many things, and Moore's law, George Michaelson 9:03 You've brought Moore's law to the table. Geoff Huston 9:06 Well, we live in a digital world. George Michaelson 9:08 I do actually think it's applicable. I think there's a quality behind things naturally improving in their computational capacity or size at that pace that has fundamentally altered how we think about the problem space, so I don't think you're wrong to do it Geoff Huston 9:23 At great expense to the Australian nation. It had built a new fiber based undersea cable connecting Australia to basically the US West Coast, PACRIM east and PACRIM West. And it was a massive cable. It supported 500 megabits per second of capacity on each cable, and it cost a billion dollars, and it was very expensive, and the network had to ration people's use. And this wasn't just one cable system, it was the entire underlying communication system, which was based on its a scarce and expensive resource. We, the network operators, are the gatekeepers of that resource. We will perform rationing, and we will operate this system at a price premium to make sure we're not overwhelmed. George Michaelson 10:13 Pricing was deliberately used as a mechanistic approach way above the individual network conversation level to put constraint on behavior. We're not talking about price comes into the packet they didn't have dollar signs literally baked in. We're talking about the consequence of your behavior when a bill arrives, being back pressure for the next cycle of using the resource, Geoff Huston 10:37 Right. It's what we call in economics, a scarcity premium. And the operators of that system, the ratios, were quite happy to basically say, This costs a whole lot more to you than it did to us. To make it because we haven't got enough, we have to ration it out. And quite frankly, cost price is the way we're doing that distribution function now, over the years, Moore's Law has turned our 500 megabit per second, terrestrial subsea cable into what 20 terabits. That's 20,000 1000 million bits per second. George Michaelson 11:12 Yes, 20 million times faster than yes we had. Geoff Huston 11:17 Yes. And it costs the same. In fact, it cost half the amount, about 500 million to do this, rather than a billion dollars, even in just straight dollar terms. George Michaelson 11:25 Strangely enough, the Marine fuel oil that powers the ship to deliver that cable and dig the trench has not increased in cost 20 million times. So the actual cost in terms of doing the ship side hasn't changed. Geoff Huston 11:43 Right? The comms world has magically shifted from scarcity to massive abundance, and this whole idea that I need to arm my network with hugely expensive cutting edge technology, SRV6, RSVP, TE, blah, blah blah is basically, I think, a con by the CO conspirators of the carriage industry, the vendors to try and extract some money from an otherwise almost bankrupt industry that has little residual value. Comms is now so abundant you could.. look.. think of as sewerage. It's a mere undistinguished commodity. Everything happens at the upper level. And that's just the simple sort of transformation that I would come home to Moore's law with better digital signal processing. And they improve every year. We can jam more data down the same piece of glass, [George: yeah] And it's as simple as that, and we keep on getting better. George Michaelson 12:43 I do like this part of the story, and I think it's an important part of the story, but there's a part of me that's also saying, Now hang on a minute, Geoff, let's go back and talk about some other things that you've brought to the table. The nature of communications for you and me to be talking now has subtly shifted. You are getting packets that are caused because I ultimately send packets into the network. Are you getting the same packets I put in? No, not necessarily. This is probably a mediated conversation. The intermediaries, the centrality thing that we've talked about, means that an awful lot of these packet flows are now heading to very short distances from where you and I are to do their primary task, and that's introduced the subtle shift here, the cost component of making all this stuff work isn't the congestion in the fiber. The cost congestion here might be switching bandwidth in a DC, or it might be the ground cost of building out that DC, there are still costs in this network. Geoff Huston 13:47 The voice and video that you and I are having right now, and indeed when it gets, you know, done in real time, is kind of different to the voice and possibly the video that happens in a replay. And I think we can distinguish between the two. This recording will go out onto the interwebs. You know, here's the podcast from Geoff and George having a talk, and it will get loaded into not one data center, but many. Why? abundance. It's cheaper to do it that way than to broadcast it, or at least house it in one spot and get the world to suck from just there, [George: right] So we sent it out everywhere. And quite frankly, listening in non real time, it's a different kind of problem, but you and I talking [George: right] And exchanging video is not a mediated thing. And in fact, on my browser is the webRTC, the real time communications toolkit, and on your browser is yours, [George: yes] and they are sending packets to each other. We relied on a mediator to set it up. But after that, the packets are the packets. They just go end to end, [George: right] The issue is we can tolerate a certain amount of network distortion of that packet and still have. Have this is Geoff. He looks the same and his voice sounds the same without going, Oh, blub, blub, blub and dropping down into garble. And that next trick of making sure the network doesn't distort the implicit signaling in those packets and their timing is the next job congestion avoidance. George Michaelson 15:21 We've got people who are in a marketing space who are looking to deploy complex solutions, RSVP, the TE models, SRV6 things like that, in order to make this problem go away in a fashion that suits them for their needs of sold goods and services. And you're heading towards saying, if we could find ways to do simpler methods of flicking you need to modify your behavior, and if we could exploit all of the intelligence built into your system and my system, we might be able to get to a happy medium for a lot less work. That seems to be where you're going. Geoff Huston 16:00 Right the load that my device at this end puts on the network is elastic. It can repair damage to some extent. It can adjust to the conditions of the network. It's not intolerant. So I don't need a network that carves me out an assured path. It's a bit like the road system. When I traveled to the shops, I did not have a lane reserved for Geoff Clear Channel. Let's go to the shops. No one else intruding my lane. You know, it doesn't work like that. And we can all laugh at that, going, well, that's ridiculous, but in the networking world, that's the goo that they're trying to sell. It's the same goo. George Michaelson 16:42 You and I have traveled to economies in Southeast Asia that we don't have to name, where we have stood by the side of the road and watched the local police doing exactly that, creating clear channel for very important people to drive down that road. Geoff Huston 16:58 The problem with analogies is they often fall apart. George Michaelson 17:01 Yes, specifically, Geoff. Although you are paid a lot of money, you're not paid enough to get that service. So for the context we're talking in, that doesn't happen, [Geoff: right] Baked into the IP stack that you and I are living in was ICMP. And baked into ICMP was the idea you could use an ICMP message to say, Whoa, buddy, buddy. Back down. Could you send less? So haven't we always had this? Geoff Huston 17:30 Well, that was always a thought, always in the design of packet based networks, because once you drop the idea that the network is the arbiter, the rationer, the police, of using the network, then how do you signal to a sender? "Woah, buddy, that's unfair". And so you've got to actually drop further back into going. I have this network full of senders and receivers, and some senders are receivers at the same time, you know, full of a bunch of people producing packets and consuming packets. What am I trying to achieve? Does everyone get 64 kilobits per second of Clear Channel? Of course not. You get what you need. But there's a couple of overriding tenets: Firstly, everyone should be efficient. What does efficiency mean? Don't leave the network idle. If you have things to send George Michaelson 18:26 Aggregate across all the people in the system, you're seeking to have as few moments where the network isn't being used as possible. I mean, obviously, if we are humans and we're asleep and there are no machines doing their job, Geoff Huston 18:39 if there's nothing to say, idleness is fine. George Michaelson 18:42 If there's more things to say than capacity in the net, leaving space and time on the net not used is inefficient. Geoff Huston 18:50 Inefficient. So if there's space in the network and there are folk wanting to send you failed in the job, so efficiency is, you know, leave no idle spaces if there is demand, fill it. [George: Yeah] that's the first job. The second job is actually somewhat odder, and it's everyone should be fair to everyone else. Wow. So I don't need a network to police behavior. No, you don't. But what you do need is, if you imagine that every packet has elbows, George Michaelson 19:22 sharp elbows Geoff Huston 19:24 The amount of pressure it puts on all the other packets should be the same as all the others, [George: right] So, in other words, in a system where all these packets are jostling, oddly enough, that system where every packet exerts equal pressure creates a rough sharing, [George: yeah] if you had 100 packets all competing and this exert the same pressure as everyone else, they'll actually equilibrate and give themselves 1/100 of the common resource. Fairness, George Michaelson 19:51 If we just briefly step sideways, there have been things that have developed over time since the birth of the Internet and the birth of Ethernet. Said, where people have said, I'm going to take a different path. So for instance, aircraft, digital communications inside the aircraft, the devices that do that are basically Ethernet switches, but they're running a version of code that said, for the purposes we have, we're going to convert this to a time division multiplex network, and you port number six, you have exactly these time slots to fill. And if you've got nothing to say, no one else is going to talk in that slot, because I am guaranteeing you that slot. But they've done that in a highly constrained world where it's wings flapping and lights flashing Geoff Huston 20:38 And keeping the machine in the air, George, George Michaelson 20:41 Keeping the machine in the air, and they've made a rationing decision that is appropriate for their needs. We're not talking about that world. This world we're in, you and I are in, is a world of totally unrelated, almost random communications. There's no rationing scheme that's been designed to guarantee that behavior. We're doing a different way of controlling use of this space with our elbows, Geoff Huston 21:06 With our elbows, because if you seriously, seriously have the money and the need to guarantee availability at all times, and the real answer is, roll your own fiber. Stop using a shared, common medium, because we can't reserve in a shared, common medium and still get efficiency. Can't [George: Yeah], so we're talking about the public network. We're talking about the great unwashed masses, you and I, but we're talking about sharing a common resource, [George: yeah], and the objectives then are efficiency and fairness, because, you know, that's the world we've chosen to live in. George Michaelson 21:44 Yep. So I thought the elbows were ICMP. [Geoff: no] I thought ICMP baked in was the way I got to prod my elbows and say, Don't do that. You're saying there's a better way. Geoff Huston 21:56 Well, it was called Source Quench, and it didn't take long for folk to realize that source quench is not validated, George Michaelson 22:04 Not validated. Geoff Huston 22:06 I'm a sender, and I'm just sending packets into the network, and you, George, take exception to my profligate behavior, so you manufacture a synthetic ICMP source quench and send it to me. And when I receive this source quench, I have no idea that it's just George playing games with me. I think it's from a router on the path, and I'm being too profligate, so I react like I've been slapped in the face with a wet fish. I stop sending I back off. But it wasn't a real signal. It was just some people trying to shut me up or otherwise playing. George Michaelson 22:43 So we need something that is simple and that's in the control, mainly of people at the ends of this part thing, but maybe along the path a little, I think I could imagine that's useful, but it's got to be valid, and it's got to be a testable proposition, and it probably needs a modicum of signaling that, you know you said it, and you know they got it. Geoff Huston 23:04 Well, the first reaction to this now we've jumped into from not IP datagrams, into the world of TCP, the Transmission Control Protocol, [George: right] And this is the kind of special protocol, I think it's the intellectual heart of the entire Internet fabric, because it's what we call a classic sliding window protocol. I send you data, but I keep a copy. And every time you receive a packet of data, you send an acknowledgement, a simple, smaller packet, same protocol number that says I received up to byte 450 of your stream. Geoff. I can then look at my window and go, Well, okay, everything up to byte 450 has been received by George. I can delete that data from my sliding window, and then I can open the window up to compensate for the data that's gone and send you some more data. [George: Yeah] so it's a pacing protocol that says, Every time I get an ACK back from you, that says you receive data, that tells me, oddly enough, a packet has left the network. I go, ah-Ha, there's now room on the network for another packet. So you can think of this as a gigantic conveyor belt. [George: Yes] every time I get an ACK frame, I send the data frame back in, and the belt has precisely the same amount of data continuously. Data goes forward, ACKs come back, steady state George Michaelson 24:29 Right? And if we look at this, both you and I have some degree of influence over the rate of play. If I choose not to tell you I've got things you're not going to send when your buffer fills up, Geoff Huston 24:40 And I'm going to stop. George Michaelson 24:42 And if I have told you, you're free to send things, you're also free not to send them, but you can, in principle, control the rate, because you can choose not to send we both have a role. Geoff Huston 24:54 Once you tell me you've got it, that's it. I've forgotten any anything I did. It's over. But yes. Is, how fast should I put things onto the network? Is not given here. We don't know, [George: yeah] So if you look then deeper into the network, you find there are two things. There's transmission, which is indeed a pure conveyor belt, [George: yeah] I put a packet in, it goes to the other end, speed of light, speed of, you know, signal through cop or whatever, and pops out the other end, totally deterministic. It might get corrupted, bit errors, radio whatever, but on the whole, packet in, packet out. But the rate of sending is not necessarily equal to the capacity of the next circuit. So I need to adapt. And the way we do this is every single link has a buffer, a queue, so that when a packet gets to a switch and it says, Well, dear packet, you need to go out interface three, but it's busy right now coping with another packet, you're gonna have to wait until I've cleared that packet before sending you so it pops it in a queue. It's human behavior. You know, you want to get onto this road in your car. The road is full. Wait for your turn, buddy. It's exactly the same principle. So the network is full of queues and links. [George: Yeah] Now imagine what happens when I send a lot of data a lot it overwhelms the link, and then the queues start to fill, because I'm still sending and then the queue gets full, [George: right] full, full. What do you do with the next packet? George Michaelson 26:28 You got to throw it away. Geoff. There's nothing else to do. Geoff Huston 26:31 It's IP underneath all this. Every packet is an adventure. Throw away. A packet is perfectly fine. Perfectly fine. George Michaelson 26:39 You and me sending signals. I got that. And we've got queues come in between, which means along the path between you and me, people can stack up a little pile of them and let go. And we've now arrived at if you're one of the people stacking up a pile of packets and your stack is full, what are your options? And your options are throw something away. Geoff Huston 27:02 You've got no other choice, so it doesn't actually require a source quench or anything. You just throw it away, [George: right] That's that's perfectly fine, but let's think about the poor old receiver. I receive packet three, ACK, I receive packet four. ACK, I receive packet six. George Michaelson 27:18 Hang on. Geoff Huston 27:19 Oh, hang on. Hang on, where's five? I'm going to wait almost no time. It'll be packets might have been reordered, but pretty soon I'm going to say, hey, Geoff, I'm not going to ACK packet six, [George: yeah], because I'm missing five. So tell you what, bud, I'm going to do a duplicate acknowledgement of four, saying I've got up to four, but you cannot get rid of five because, you know, I haven't got it, [George: yeah] now that duplicate ACK is an unambiguous signal of loss, and it tells me two things that are interesting. One, I need to resend packet five, but the next one is a bit of a leap, but it's true. I'm going to assume I was sending too fast. George Michaelson 28:03 So it actually requires two behaviors in you. You said with the window, when you've received an ACK, the past is gone and you forget about it. You've just slightly modified that the past is gone, but remember the last thing you had ACK'ed, so that when you open your window and send stuff. If you see a repeat ACK against the last thing you send, you've been told loss has taken place. You have to remember a little bit more. You can't forget all of the past. Geoff Huston 28:32 Well, the number is actually a sequence number, and it's the number of the last acknowledged byte. And so the calculation is really easy. The first packet in the queue has a byte number. If the ACK value is less than that number, it's a duplicate ack, [George: right] So I'm not dealing with packet numbers, 12345, oddly enough, TCP is a streaming protocol, and you actually number the bytes, which is its own problem, and it's all about high speed. And let's not go there because, you know, we haven't got time today. But you know this, this protocol is kind of magic in so far as if I interpret duplicate ACKs as I'm going too fast, [George: yeah] then I need to stop sending to let the cues drain. Oh, good. How long should I stop sending for? Oh, I don't know, as long as the queue is big. Oh, now there's a clue. What if the queue held the same number of packets as the link that it drives? So if the link could contain three packets, let's make the queue contain just three packets, George Michaelson 29:39 right? But that's magic knowledge. Geoff, Geoff Huston 29:41 No, no, this is the design of networking. This is the bit that fills in the gap of what do you do when you get a duplicate ACK, if we build the networks such that every link buffer is equal to in number of packets the size of the link that it's driving. George Michaelson 30:01 So every point to point link, every party at either end of point to point link, makes its buffer equal to the number of packets that can legally exist in flight between those points. Then by definition, if something goes wrong and you have to calculate the question, how big is the queue? You know, Geoff Huston 30:25 I know it's the same as the amount of data I can send in one round trip time interval, the time it takes to send a packet from you and an ACK back. Wow. George Michaelson 30:37 So am I right in thinking that this actually really really, is the law of physics coming up Geoff that if we've got a length of fiber, Geoff Huston 30:45 no, no, no, it's not. George Michaelson 30:46 If we've got a length of fiber between you and me that's a foot long, we can store less data in it than if that is 1000 kilometers long. Geoff Huston 30:55 Yes, of course, it says physics. Why? I said no, the law of physics this will get back to buffer dimension in the second. But the whole issue is I now know how long to wait. I wait for that so called round trip time, because every time I do send you a packet and I get back an ACK, I've got my good clock. We've all got clocks. Tick, tick, tick, tick, tick, you're 22 milliseconds away. So when I get a loss signal, I know how long I should back off for to let the cues drain. [George: Yeah]. Because if I don't send any more, then basically what's in the queues will flow outwards to you. And if I resume at the end of one round trip time, guess what, you will receive a constant flow without interruption, you will see no idle time. George Michaelson 31:44 So you know two things. You know the byte number last ACKed, and you know the round trip time between those points, and you pace your behavior so that if you ever get a loss in that pause for the round trip time and then carry on Geoff Huston 31:59 at a lower speed. George Michaelson 32:01 Ah, three things, Geoff Huston 32:02 because you just build the queue. So obviously you were sending too much, buddy. So I do two things. I wait to let the queues drain. And now the next trick, I'm going to send at half the rate. Why half? Because the queue size is where the other half of the packets lived. And so it magically forms up that if you just think of this as one connection through a multi hop series of links with queues, then this just works. I keep on going to effectively twice the capacity of the path. I fill up the queues. I get a drop. I wait for round trip time. Queues go away. I start sending at half the rate. The next packet will fly straight at you because the queues have been drained. Oddly enough, efficiency 100% George Michaelson 32:53 Well, it sounds like there are periods when you're not using the link, and you have to assume someone else fills that hole. Geoff Huston 33:02 Oh, no, no, I haven't talked about, you know, "Hell is other people." I think, as John Paul Sartre said, I haven't talked about other people yet. It's just you and I on an otherwise idle network. And this is the theory of congestion flow control [George: right] now. We introduce the hell of other people. And this makes life fascinating, because I'm not the only person contributing to that queue. What should I do? There was a theory, and it wasn't a very strong one, but it was a theory that with multiple senders, each with different round trip times, each with, you know, all going through the same link, if you all use the same algorithm, two things would happen. Firstly, the queuing behavior would be most influenced by the biggest of the senders. Secondly, if everyone used the same flow algorithms, you would each get 1/8 of the share of that link. George Michaelson 34:01 But those two statements are in conflict, because if it's most influenced by the biggest, [Geoff: yeah], how can there even be a biggest when you've come down to having 1n? Doesn't that imply that some amount of the link isn't being used? Geoff Huston 34:16 It wasn't the best of theories, right? And after about 30 years of staring at networks and queues, we're kind of coming around to a new theory, [George: right] And it's a fascinating theory because it's actually helpful. Don't forget, the faster networks get, the more expensive it is for the memory in that queue. If I'm driving a terabit per second link, I need to pull a terabit per second out of that memory bank. What manufacturer makes memory that can deliver sustained a terabit of data from memory George Michaelson 34:50 that's not cheap memory Geoff Huston 34:51 Trick question, no, it's not even expensive memory. It does not exist. George Michaelson 34:56 So we're no longer bound by the speed of the link. We're bound. By the speed of getting things into and out of the buffer. Geoff Huston 35:03 Maybe it's the wrong theory. You see that buffer memory is the most expensive memory in the router. It's the most expensive memory you can buy. We use every parallelism trick in the book to try and make it scale up in speed, but it's really, really difficult, because speed has not changed. Moore's law is not about double the speed. It's double the chip density. The clock size has been static for years, but all of a sudden, these folks staring at queues came up with a fascinating answer, and it says, you know, it doesn't work like that dude, the buffer size is not just the bandwidth times the delay. The buffer size is the bandwidth by the delay divided by the square root of the number of sessions. George Michaelson 35:47 I hate it when people bring square roots to the table. Geoff Huston 35:51 I know. -Let's say a one terabyte per second circuit, and it has a 100 millisecond of delay. Simple stuff, 12.5 gigabytes of buffer, right? [George: Yeah], but let's say that same terabit of circuit carries 100,000 flows. I don't need 12.5 gigs of memory. Do you know how much I need 40 megs? George Michaelson 36:12 Because you divide the initial ... Geoff Huston 36:15 by the square root of 100,000 George Michaelson 36:18 if you have more people involved, the amount of buffer goes down. Geoff Huston 36:22 Savagely. goes down. It's it's kind of what saved our bacon. George Michaelson 36:27 So I'm kind of intuiting that what happens as a consequence is the buffer fills up, and more people get a signal, you just had a loss event, and more people have to go through some kind of backing down, and therefore more people balance their traffic out to be at that mythical one end, and at that point, everybody's pacing to match the buffer. This is a result that is experimental, not analytical. It's what we've observed. It's been seen. Geoff Huston 37:00 It's been seen and seen a lot, and it's, it's kind of, in some ways, intuitively obvious, particularly with that buffer bloke, topic that circulated for years that networks were continuing to over dimension buffers. Why? Because I can sell customers really expensive routers with a huge amount of high speed memory that is actually making life worse, not better? George Michaelson 37:21 Well, it looks great in the showroom when there's only you, and the other end, it's when you deploy it in real life with hundreds of 1000s of users, you start to find out it didn't behave the way you thought. Geoff Huston 37:32 So let's take that 40 megabytes. How much memory can I put in an ASIC that also does switching. So I've now got a chip that's divided into two bits, let's say 40% of the chip real estate I'll use for my switching fabric, the cross switching, right? [George: Yeah] And they've got 60% of the real estate that I can use for memory. [George: Yeah] Now, in general, I can do around about 40 to 60 megs of memory on a single chip, George Michaelson 38:01 this special, incredibly high speed memory, because ordinary memory, the densities have reached the point that you could fit gigabytes, terabytes on the chip. Geoff Huston 38:11 Now I don't want to.. I want really, really high speed. So I've now got a terabit per second switch with 40 megs of buffer. And as long as I'm in the middle of a really busy setup with a huge number of flows, it will just work. George Michaelson 38:31 You need the high number of flows? Geoff Huston 38:34 So let's say if you're Amazon and you're designing your switch fabric, Lincoln Dale's presentation at AUSNOG this year. And let's say that you didn't want to buy vendor equipment because it contains 60 million lines of code that you're never going to use. Let's say you want to trim it down, automate the hell out of it, and just switch, [George: right] Well, you're inevitably linked to I'm going to do this on one chip. I'm going to use about 64 megs of memory in the check buffer, and all I'm going to do is switch. Wow, this works. George Michaelson 39:05 Get rid of all that smart, all that extra technology, all those complex additional features. Just do packets in, packets out, drop when buffer full. Geoff Huston 39:14 Not the only ones who have reached this conclusion, but you know, you can see now how we scale, how we actually make these enormous systems work. And it's kind of the initial theory, you know, this concertina like behavior of TCP, faster, slower, faster, slower. And then if you put them all together, the amount of buffer demand starts to drop, which actually means you can squeeze a huge number of sessions in on the one switching fabric. And oddly enough, the more sessions you can squeeze, the lower the aggregate memory demand. Wow, this is one case where practice actually runs in your favor. It's a stunning answer, isn't it? George Michaelson 39:54 So that's very unexpected. Geoff, because I thought we'd be heading to a middle ground where you're saying. These complex protocols like RSVP don't do the trick, but if we do just enough additional signaling, we can get the benefit. But you've actually said just stick with straight TCP and change your expectations of behavior and remove all the complexity from the switching components. It turns out things work better than you thought. I'm going to rephrase an old adage. I think it from Andrew Odlyzko, but I think I was guilty of saying the same thing as well. There is no problem that more bandwidth doesn't fix you just add more bandwidth. It's cheaper and simpler that if a real problem was I don't have enough network to carry the traffic. Stop trying to ration, stop trying to condition, stop trying to color the packets red, green and purple. You're wasting your time. Bring up another color, add bandwidth, because you can do that. Everything else works in your favor, because then you can get back to a very simple task. And what about TCP? Well, fascinating. Do I actually need to rely on loss. Or can I try and sense the onset of queuing is where we are at the moment in research, [George: right] And it's kind of well, loss is a bit catastrophic. It's like driving your car faster and faster and faster until you have a catastrophic Smash, oops. What about if I keep on driving faster until I nudge the car in front of me. No damage yet. No damage. Just nudge and then back off. Dear listener, please do not drive like this in any economy, anywhere on the planet. Geoff Huston 41:35 When Geoff is on the road, stay clear. Yes, same thing. But you know, you think about what's going on there. If I can just do this to when queuing starts, not when it collapses, then, oddly enough, the winner is not the queue. The winner is the TCP session, because I waste no time in repairing packet loss. George Michaelson 41:56 Right If you can target to knowing when you're using the appropriate amount of available buffer target it, then you don't incur the cost of a loss to be there. But that does beg the question, couldn't we have the simplest, the most lightweight of signal to flip down the line you're getting close? Geoff Huston 42:14 And this, this was an idea from the 1980s early 1980s from Digital Equipment. It was called the Explicit Congestion Notification George Michaelson 42:23 ECN. Geoff Huston 42:24 ECN, an otherwise unused bit in the IP header. And when you get a packet and try and put it on the queue, if the queue is looking pretty full, just mark the packet or so what at the receiver of that marked packet, because you're running TCP. In other words, is a backward flow of the ACKs. Take that signal, there was congestion on the inward path, and send that back as I saw a congestion signal in the TCP ACK, George Michaelson 42:52 wait, wait. You just, wait. You just did. If I can use a horrid legalism, didn't you just do a layer violation, that you put a signal in an IP header and you acknowledged it in a TCP transaction. Geoff Huston 43:08 Yeah, George Michaelson 43:09 that's weird. Geoff Huston 43:10 Yeah, I did well. It was kind of, if you want to know when queuing is starting, why not ask the router who was starting a queue? In fact, why not get the router to tell you George Michaelson 43:21 I kind of like this Geoff thinking about this, because you said earlier, ICMP couldn't cut it, because anyone along the path could stuff in line. But you've just inserted into this where you get the IP signal. You're going too fast. You get it before you've incurred a loss consequence, and you signal back in TCP. Somebody said something, there's a bit of acknowledgement that this event took place. Is this harder to forge? Geoff Huston 43:47 Well, we actually no, it's hard to forge, and it's been around for years. It was an experimental RFC, then a standards track RFC, and a few folk have given a lot of interest in it. Apple and Comcast have started a thing they call low latency, low. It's L4S, low, everything but high scalability, [George: right] Problem is it's a specialized solution for a specialized environment. The world doesn't do this in APNIC, we started measuring ECN, and you know, 2 to 3% of users sit beside equipment that does ECN, it's not there. George Michaelson 44:27 So if I go to another story we've covered on PING, other protocols have observed the behavior that when you include fields in headers of packets to say we think this might be useful later. There's a thing that's been called greasing. You need to periodically change the value you send because otherwise intermediate systems go that field is zero. If it's not zero, I'm throwing the packet away. Geoff Huston 44:53 I'm making the field zero. And ECN suffers from that your marked IP packet gets reset George Michaelson 44:58 Because we didn't do greasing down in the IP layer. When we said how to do IP these bit fields can't reliably be sent in the public net. But if we're talking me and my fabric, I can run systems that do this. And if we're talking me to a CDN, they could run systems that do this. Geoff Huston 45:17 If you're in a closed shop, go for it. If you're out on the big public Internet, doesn't work. So how else do you do it? Well, I get back to BBR, the bandwidth bottleneck protocol, which is fascinating. How do I know when queues are forming? Ah, I don't get out a little hammer. I get out a big mallet, and for one round trip time I hit the network with my mallet. What do you mean? I send 25% more packets for one RTT interval, George Michaelson 45:48 A lot more than you think you really should. Geoff Huston 45:51 25% more than I was sending. [George: Yeah] and it's kind of did you feel that network? How do I know if you felt it? I look at the delay. Because if those 25% extra packets ended up in the queue, then for that round trip time, the measured round trip time will get a lot longer. And if that happens, I know I was already at the sweet spot. I was already at the point where queues were starting to form because the one interval when I dramatically went overboard, I saw an extension of time to get the packets through, because queuing was happening. And don't forget, in the next round trip time, send 25% less to drain the damage you just did. George Michaelson 46:37 You know that you were closer to the sweet spot? It might have been a little ahead of you, but you don't think it's radically ahead of you. Geoff Huston 46:45 Whereas, if there was no change in the round trip time, I can go 25% faster immediately. It's a very aggressive protocol. George Michaelson 46:55 It sounds like it has a risk of being greedy against other people's behavior. Geoff Huston 46:59 Again, we get back to the two things I was talking about. The start of this, efficiency and fairness. George Michaelson 47:05 Fairness. Geoff Huston 47:06 BBR does this trick once every eight round trip times. So if there's idle capacity, it won't be idle for much longer. At some point, the mallet will come out within eight round trip times, and you know, BBR will hit the network go, can you take more traffic? So it is extremely efficient in finding the ceiling. Is it fair? Ooh, when you've got other BBR sessions, it's kind of fair, but it's not very fair. One session might end up winning, the other one losing, and you can't tell, is it fair to all the other things that are still out there doing loss based congestion? George Michaelson 47:48 No. Geoff Huston 47:48 Oh, yes, exactly. It kind of overwhelms them because it's more aggressive [George: right] your elbows are longer, a lot sharper, and they reach further. You tend to put undue pressure on everyone else. And the designers of BBR have gone through BBR version 2, which was kind of too wimpy. It sort of cut off the aggressive parts, and in the end, it deferred all the bandwidth to all the loss based systems didn't end up with much favor. George Michaelson 48:17 So again, this is experimentally driven, [Geoff: Oh, yeah] And we tune this right? BBR's hunting for a magic point. And that point is, be a little more aggressive to test if there's room, but do a better job of fairness when you decide you need to coexist. That's that's really quite an interesting outcome. Geoff, that's going to an interesting place. Geoff Huston 48:39 Well, that's BBR v3 which is still being worked on. But, you know, tweaking is happening from purely selfish perspective. I really like BBR v1 but, you know, like driving a bulldozer everyone else clear out because you've got no choice. So it's kind of fascinating, George, that the answer is not to put more stuff in the network to make the network operate fairly and efficiently, it's actually to be clever in the end points and treat the network as a very simple, very dumb set of switches and queues. And to my mind, that's been the entire magic of the Internet protocol for the last 30 odd years. And what do you say to SRV6 and RSVP and TE and so on? You might have a use case in your private world. That's fine, [George: yeah], but if you try and think this is good for everyone, [George: no] I think that's delusional. Yeah. It's just not true. George Michaelson 49:36 The ECN story in this is perhaps a little unfortunate if we'd had the concept of greasing and if we had a clear channel signal for something mechanistically simple, like ECN, I personally think that might have been nice, but, you know, wanting a rainbow pony doesn't get you one, right? This is the world we live in. Geoff Huston 49:56 Do you do the ECN marking when the queue is full? When the queue is Empty when the queue is half full? and I'll tell you what, if there's choice, everyone will have a different choice point. What does ECN marking actually mean? Should I react quickly? Do I have time? And so all those uncertainties, I think, sort of came with ECN was an interesting approach 20 odd years ago, and it's interesting in a closed world, if you can define the space absolutely, [George: yeah], but out there in the big brainwashed Internet, no, no, push it out to the edges. Make the end TCP protocol engines do all the work, and they're quite capable of doing a really good job. George Michaelson 50:36 Yeah. I think it sounds like people should maybe tune into the recording of Lincoln's talk at aus not can get a handle on how Amazon's doing this, Geoff Huston 50:44 Assuming they recorded it. No promises here, but you might want to listen to some of the talks about high speed data center switching, and there's a few going around right now about you know, what does speed mean, and how do we achieve it? And part of this we've got so specialized these days, in those data centers, you can't just use commodity rack equipment. You really are saying, if I want terabits running through this equipment, I need to actually go very close to customized solutions. George Michaelson 51:14 But those customized solutions, it turns out, are significantly simpler in some senses. Geoff Huston 51:22 Oh yes, it's like a racing car. You take out all all the bits of comfort, you know nothing's left that I switch and I have memory. What else do you do? I don't know. There's nothing else to do. George Michaelson 51:32 Yeah. Geoff, that's been fascinating. Thank you. It's a pleasure. Geoff Huston 51:35 George and I hope the listener has found that interesting. You can find a write up of my work with trying to measure ECN on the APNIC blog. I think it'll be out very soon, certainly by the time this is aired about our efforts to try and capture ECN in the wild and what it might mean. George Michaelson 51:52 We'll make sure that references are put into the blog that goes with the podcast. Thank you Geoff. Geoff Huston 51:56 Thank you George. George Michaelson 51:59 If you've got a story or research to share here on ping, why not get in contact by email to ping@apnic.net or via the APNIC social media channels, also remember the measurement@apnic.net mailing list on orbit is there to discuss and share relevant collaborative opportunities, grants and funding opportunities, jobs and graduate placings, or to seek feedback from the community on your own measurement projects, be sure to check out the APNIC website for All your resource and community needs until next time you