Geoff Huston  0:00

    Well, that was always a thought, always in the design of packet
    based networks, because once you drop the idea that the network is
    the arbiter, the rationer, the police of using the network, then
    how do you signal to a sender? "Woah, buddy, that's unfair". And
    so you've got to actually drop further back into going. I have
    this network full of senders and receivers, and some senders are
    receivers at the same time, you know, full of a bunch of people
    producing packets and consuming packets. What am I trying to
    achieve? Does everyone get 64 kilobits per second of Clear
    Channel? Of course not. You get what you need. But there's a
    couple of overriding tenets: Firstly, everyone should be
    efficient. What does efficiency mean? Don't leave the network idle
    if you have things to send.

George Michaelson  1:03

    You're listening to ping, a podcast by APNIC discussing all things
    related to measuring the Internet. I'm your host, George
    Michaelson, this time I'm talking to Geoff Huston from APNIC labs
    again in his regular monthly spot on ping. Geoff, and I talked
    about his recent trip to AUSNOG in Melbourne, and a talk he heard
    there about optimizing throughput and fair shares of a link in the
    modern fiber optic dominated high speed network of routers and
    switches. He's been looking at TCP flow control models for some
    time and recently testing BBR in a variety of ways, and the talk
    reminded him of an older IP layer model of Explicit Congestion
    Notification, or ECN. ECN is baked into the IP header and returns
    signals inside TCP, but it isn't proving viable in the wild. The
    idea of sending back a signal you're pushing too hard isn't
    exactly new, but as Geoff has learned, the mechanism isn't working
    out as well as we might hope, and so protocol tweaks like BBR,
    which is now in its third iteration, have to try and do a better
    job of both bandwidth estimation and managing the competition for
    space in order to be fair. Geoff, welcome back to ping. What
    should we talk about this time? Well, George,

Geoff Huston  2:28

    I've just come back from the annual Australian network operators
    group shindig down in Melbourne in September and start of
    September, and it was a great couple of days. And you know, if
    anyone's cruising around Australia at that sort of time of the
    year and has nothing better to do for a couple of days, I'd
    certainly recommend going to drop in on AUSNOG. It is great.

George Michaelson  2:49

    Oh, you got to be quick. Tickets go like that. [click] They've
    gone in a flash.

Geoff Huston  2:54

    They've been sold out for the last 18 years, which is amazing.
    Just Sold out. Very popular Anyway, yes. And I was talking about
    the evolution of transport protocols, and I kind of thought, well,
    you know, unlike other kinds of topics that one could talk about,
    like V6, why bother? Or RPKI, offers about as much protection as a
    an armor plated suit made of wet lettuce leaves. You know, I
    thought the transport protocols was, it was a much tamer topic,
    without the controversy that these other ones breathe.

George Michaelson  3:25

    Oh, you say that, but I have a feeling it'll turn out not to be
    the case. But Pray continue.

Geoff Huston  3:31

    Well, I spoke just after a gentleman from from Cisco who was
    hawking a different kind of transport system based on,
    effectively, MPLS, source based routing bits of V6, all going by,
    oh, and RSVP, I think, came into it. It's sort of all the bits
    that are just weird, [George: right] And I'll tell you what, other
    than the names of the technologies. This was a talk from the 1970s

George Michaelson  3:59

    this idea in some air quote sense better.

Geoff Huston  4:03

    Well, it's also an idea that, like speaking into a telephone
    handset. Remember them, the application is inelastic, dumb and
    invariant, and will not adapt to any conditions in the network.
    You know, I've yet to meet those applications. And secondly, then
    it's left to the network to effectively behave smoothly and
    consistently and not muck around with the explicit timings coming
    in from the application traffic.

George Michaelson  4:36

    Well, we've had an idea of Smart Edge dumb core as a kind of
    concept in constructing networks for a long time. I wouldn't say
    it's baked into the IP protocol, but the idea was, you wouldn't
    have massive intelligence in the core of packet forwarding. You'd
    try to keep that to a minimum.

Geoff Huston  4:53

    Well, that was the whole thing. If you have computers mediating
    the application workflow, they're. You don't need to blindly try
    and preserve absolute adherence to synchronicity. You know, it's
    not just a telephone. It's a telephone going through a computer.
    You quantize the voice signal, you chop it up into packets, you
    send it out as packets. But the receiver, also, a computer, is not
    just a speaker. It firstly goes into an input queue and
    reassembles, if you will, the analog signal as it assembles out of
    the local buffer the incoming packets and then reproduces a
    constant voice signal. You and I are having a fine conversation,
    and dear listeners, you're listening to this in real time, but
    there's no real time circuitry. There's no we've carved out a
    channel. It's got absolutely zero jitter and high fidelity. It's
    exactly the same as telephony. Not at all. [George: No] you're
    just running

George Michaelson  5:53

    packets. No, this is really quite different to the longer held
    view from old times about the nature of communication with the
    right kind of error correcting and the right amount of buffering,
    which, of course, implies delay, but nonetheless the right amount,
    it's possible to use packets that have no deterministic behavior
    to get remarkably fine outcomes. It can be done.

Geoff Huston  6:17

    What Not only can be done, it's the default of what the entire
    world does today.

George Michaelson  6:22

    So presumably you were talking, in a sense, saying, well, all this
    complexity of MPLS and SRV6 and RSVP, there are other approaches.
    So you have an idea of a simpler approach to signaling behavior,

Geoff Huston  6:37

    reproducing an IP advocacy sort of line from the early 1980s

George Michaelson  6:42

    Oh, you've come from the 70s to the 80s.

Geoff Huston  6:45

    I come from the 70s. Yeah, no, I mean, the issue was that you
    didn't need to put all the investment into the network to cope
    with dumb devices and dumb applications. If you have intelligence
    at the edge, you can actually manage a much simpler network based
    on datagram switching, where data can be lost, reordered, even
    replicated, and it's up to the ends to actually recreate the
    original stream. And as long as you adequate enough network, you
    can even recreate the implicit timing. In other words, voice and
    video works simply because you've got smart enough edges and
    enough capacity in the network. You don't have to jam you don't
    have to jam the network full of expensive equipment, heaps of
    software, masses of complexity, and pay an enormous price to be
    able to talk through an IP system.

George Michaelson  7:40

    At one level, you've shifted where the cost is, because the fact
    remains, you have to do work. Work has to be done, but you're
    saying we have the computers anyway. We want the handsets to be
    smart. They have to have the processing capacity. We want the
    ability to do the work, and it's already implicitly available in
    edge devices. How about we use that to achieve an outcome? And the
    second thing you said is enough bandwidth so it isn't as simple as
    we can fit this down a thin piece of string. You do have to say,
    why are we constraining the world to BE thin bits of string? How
    about we treat the world as having a lot more bandwidth than we
    used to think was the limit, which, by the way, is pretty
    straightforward to do now, isn't it?

Geoff Huston  8:25

    Well, that was the second part of the story. So the first part is,
    you don't need to stuff the network full of extremely expensive
    technology to cope with dumb applications. My phone wasn't doing
    anything my smart device until I started talking. It's not as if
    it's lacking processing capability. The marginal cost of having it
    do some work in terms of adaptation of transport is zero, whereas
    in the network, there's lots of things going on. You know, it's
    actually expensive to do this. And so the first thing is, you're
    spending the wrong money in the wrong place. But there's a
    different thing that's also happened, and I blame Moore's Law many
    things, and Moore's law,

George Michaelson  9:03

    You've brought Moore's law to the table.

Geoff Huston  9:06

    Well, we live in a digital world.

George Michaelson  9:08

    I do actually think it's applicable. I think there's a quality
    behind things naturally improving in their computational capacity
    or size at that pace that has fundamentally altered how we think
    about the problem space, so I don't think you're wrong to do it

Geoff Huston  9:23

    At great expense to the Australian nation. It had built a new
    fiber based undersea cable connecting Australia to basically the
    US West Coast, PACRIM east and PACRIM West. And it was a massive
    cable. It supported 500 megabits per second of capacity on each
    cable, and it cost a billion dollars, and it was very expensive,
    and the network had to ration people's use. And this wasn't just
    one cable system, it was the entire underlying communication
    system, which was based on its a scarce and expensive resource.
    We, the network operators, are the gatekeepers of that resource.
    We will perform rationing, and we will operate this system at a
    price premium to make sure we're not overwhelmed.

George Michaelson  10:13

    Pricing was deliberately used as a mechanistic approach way above
    the individual network conversation level to put constraint on
    behavior. We're not talking about price comes into the packet they
    didn't have dollar signs literally baked in. We're talking about
    the consequence of your behavior when a bill arrives, being back
    pressure for the next cycle of using the resource,

Geoff Huston  10:37

    Right. It's what we call in economics, a scarcity premium. And the
    operators of that system, the ratios, were quite happy to
    basically say, This costs a whole lot more to you than it did to
    us. To make it because we haven't got enough, we have to ration it
    out. And quite frankly, cost price is the way we're doing that
    distribution function now, over the years, Moore's Law has turned
    our 500 megabit per second, terrestrial subsea cable into what 20
    terabits. That's 20,000 1000 million bits per second.

George Michaelson  11:12

    Yes, 20 million times faster than yes we had.

Geoff Huston  11:17

    Yes. And it costs the same. In fact, it cost half the amount,
    about 500 million to do this, rather than a billion dollars, even
    in just straight dollar terms.

George Michaelson  11:25

    Strangely enough, the Marine fuel oil that powers the ship to
    deliver that cable and dig the trench has not increased in cost 20
    million times. So the actual cost in terms of doing the ship side
    hasn't changed.

Geoff Huston  11:43

    Right? The comms world has magically shifted from scarcity to
    massive abundance, and this whole idea that I need to arm my
    network with hugely expensive cutting edge technology, SRV6, RSVP,
    TE, blah, blah blah is basically, I think, a con by the CO
    conspirators of the carriage industry, the vendors to try and
    extract some money from an otherwise almost bankrupt industry that
    has little residual value. Comms is now so abundant you could..
    look.. think of as sewerage. It's a mere undistinguished
    commodity. Everything happens at the upper level. And that's just
    the simple sort of transformation that I would come home to
    Moore's law with better digital signal processing. And they
    improve every year. We can jam more data down the same piece of
    glass, [George: yeah] And it's as simple as that, and we keep on
    getting better.

George Michaelson  12:43

    I do like this part of the story, and I think it's an important
    part of the story, but there's a part of me that's also saying,
    Now hang on a minute, Geoff, let's go back and talk about some
    other things that you've brought to the table. The nature of
    communications for you and me to be talking now has subtly
    shifted. You are getting packets that are caused because I
    ultimately send packets into the network. Are you getting the same
    packets I put in? No, not necessarily. This is probably a mediated
    conversation. The intermediaries, the centrality thing that we've
    talked about, means that an awful lot of these packet flows are
    now heading to very short distances from where you and I are to do
    their primary task, and that's introduced the subtle shift here,
    the cost component of making all this stuff work isn't the
    congestion in the fiber. The cost congestion here might be
    switching bandwidth in a DC, or it might be the ground cost of
    building out that DC, there are still costs in this network.

Geoff Huston  13:47

    The voice and video that you and I are having right now, and
    indeed when it gets, you know, done in real time, is kind of
    different to the voice and possibly the video that happens in a
    replay. And I think we can distinguish between the two. This
    recording will go out onto the interwebs. You know, here's the
    podcast from Geoff and George having a talk, and it will get
    loaded into not one data center, but many. Why? abundance. It's
    cheaper to do it that way than to broadcast it, or at least house
    it in one spot and get the world to suck from just there, [George:
    right] So we sent it out everywhere. And quite frankly, listening
    in non real time, it's a different kind of problem, but you and I
    talking [George: right] And exchanging video is not a mediated
    thing. And in fact, on my browser is the webRTC, the real time
    communications toolkit, and on your browser is yours, [George:
    yes] and they are sending packets to each other. We relied on a
    mediator to set it up. But after that, the packets are the
    packets. They just go end to end, [George: right] The issue is we
    can tolerate a certain amount of network distortion of that packet
    and still have. Have this is Geoff. He looks the same and his
    voice sounds the same without going, Oh, blub, blub, blub and
    dropping down into garble. And that next trick of making sure the
    network doesn't distort the implicit signaling in those packets
    and their timing is the next job congestion avoidance.

George Michaelson  15:21

    We've got people who are in a marketing space who are looking to
    deploy complex solutions, RSVP, the TE models, SRV6 things like
    that, in order to make this problem go away in a fashion that
    suits them for their needs of sold goods and services. And you're
    heading towards saying, if we could find ways to do simpler
    methods of flicking you need to modify your behavior, and if we
    could exploit all of the intelligence built into your system and
    my system, we might be able to get to a happy medium for a lot
    less work. That seems to be where you're going.

Geoff Huston  16:00

    Right the load that my device at this end puts on the network is
    elastic. It can repair damage to some extent. It can adjust to the
    conditions of the network. It's not intolerant. So I don't need a
    network that carves me out an assured path. It's a bit like the
    road system. When I traveled to the shops, I did not have a lane
    reserved for Geoff Clear Channel. Let's go to the shops. No one
    else intruding my lane. You know, it doesn't work like that. And
    we can all laugh at that, going, well, that's ridiculous, but in
    the networking world, that's the goo that they're trying to sell.
    It's the same goo.

George Michaelson  16:42

    You and I have traveled to economies in Southeast Asia that we
    don't have to name, where we have stood by the side of the road
    and watched the local police doing exactly that, creating clear
    channel for very important people to drive down that road.

Geoff Huston  16:58

    The problem with analogies is they often fall apart.

George Michaelson  17:01

    Yes, specifically, Geoff. Although you are paid a lot of money,
    you're not paid enough to get that service. So for the context
    we're talking in, that doesn't happen, [Geoff: right] Baked into
    the IP stack that you and I are living in was ICMP. And baked into
    ICMP was the idea you could use an ICMP message to say, Whoa,
    buddy, buddy. Back down. Could you send less? So haven't we always
    had this?

Geoff Huston  17:30

    Well, that was always a thought, always in the design of packet
    based networks, because once you drop the idea that the network is
    the arbiter, the rationer, the police, of using the network, then
    how do you signal to a sender? "Woah, buddy, that's unfair". And
    so you've got to actually drop further back into going. I have
    this network full of senders and receivers, and some senders are
    receivers at the same time, you know, full of a bunch of people
    producing packets and consuming packets. What am I trying to
    achieve? Does everyone get 64 kilobits per second of Clear
    Channel? Of course not. You get what you need. But there's a
    couple of overriding tenets: Firstly, everyone should be
    efficient. What does efficiency mean? Don't leave the network
    idle. If you have things to send

George Michaelson  18:26

    Aggregate across all the people in the system, you're seeking to
    have as few moments where the network isn't being used as
    possible. I mean, obviously, if we are humans and we're asleep and
    there are no machines doing their job,

Geoff Huston  18:39

    if there's nothing to say, idleness is fine.

George Michaelson  18:42

    If there's more things to say than capacity in the net, leaving
    space and time on the net not used is inefficient.

Geoff Huston  18:50

    Inefficient. So if there's space in the network and there are folk
    wanting to send you failed in the job, so efficiency is, you know,
    leave no idle spaces if there is demand, fill it. [George: Yeah]
    that's the first job. The second job is actually somewhat odder,
    and it's everyone should be fair to everyone else. Wow. So I don't
    need a network to police behavior. No, you don't. But what you do
    need is, if you imagine that every packet has elbows,

George Michaelson  19:22

    sharp elbows

Geoff Huston  19:24

    The amount of pressure it puts on all the other packets should be
    the same as all the others, [George: right] So, in other words, in
    a system where all these packets are jostling, oddly enough, that
    system where every packet exerts equal pressure creates a rough
    sharing, [George: yeah] if you had 100 packets all competing and
    this exert the same pressure as everyone else, they'll actually
    equilibrate and give themselves 1/100 of the common resource.
    Fairness,

George Michaelson  19:51

    If we just briefly step sideways, there have been things that have
    developed over time since the birth of the Internet and the birth
    of Ethernet. Said, where people have said, I'm going to take a
    different path. So for instance, aircraft, digital communications
    inside the aircraft, the devices that do that are basically
    Ethernet switches, but they're running a version of code that
    said, for the purposes we have, we're going to convert this to a
    time division multiplex network, and you port number six, you have
    exactly these time slots to fill. And if you've got nothing to
    say, no one else is going to talk in that slot, because I am
    guaranteeing you that slot. But they've done that in a highly
    constrained world where it's wings flapping and lights flashing

Geoff Huston  20:38

    And keeping the machine in the air, George,

George Michaelson  20:41

    Keeping the machine in the air, and they've made a rationing
    decision that is appropriate for their needs. We're not talking
    about that world. This world we're in, you and I are in, is a
    world of totally unrelated, almost random communications. There's
    no rationing scheme that's been designed to guarantee that
    behavior. We're doing a different way of controlling use of this
    space with our elbows,

Geoff Huston  21:06

    With our elbows, because if you seriously, seriously have the
    money and the need to guarantee availability at all times, and the
    real answer is, roll your own fiber. Stop using a shared, common
    medium, because we can't reserve in a shared, common medium and
    still get efficiency. Can't [George: Yeah], so we're talking about
    the public network. We're talking about the great unwashed masses,
    you and I, but we're talking about sharing a common resource,
    [George: yeah], and the objectives then are efficiency and
    fairness, because, you know, that's the world we've chosen to live
    in.

George Michaelson  21:44

    Yep. So I thought the elbows were ICMP. [Geoff: no] I thought ICMP
    baked in was the way I got to prod my elbows and say, Don't do
    that. You're saying there's a better way.

Geoff Huston  21:56

    Well, it was called Source Quench, and it didn't take long for
    folk to realize that source quench is not validated,

George Michaelson  22:04

    Not validated.

Geoff Huston  22:06

    I'm a sender, and I'm just sending packets into the network, and
    you, George, take exception to my profligate behavior, so you
    manufacture a synthetic ICMP source quench and send it to me. And
    when I receive this source quench, I have no idea that it's just
    George playing games with me. I think it's from a router on the
    path, and I'm being too profligate, so I react like I've been
    slapped in the face with a wet fish. I stop sending I back off.
    But it wasn't a real signal. It was just some people trying to
    shut me up or otherwise playing.

George Michaelson  22:43

    So we need something that is simple and that's in the control,
    mainly of people at the ends of this part thing, but maybe along
    the path a little, I think I could imagine that's useful, but it's
    got to be valid, and it's got to be a testable proposition, and it
    probably needs a modicum of signaling that, you know you said it,
    and you know they got it.

Geoff Huston  23:04

    Well, the first reaction to this now we've jumped into from not IP
    datagrams, into the world of TCP, the Transmission Control
    Protocol, [George: right] And this is the kind of special
    protocol, I think it's the intellectual heart of the entire
    Internet fabric, because it's what we call a classic sliding
    window protocol. I send you data, but I keep a copy. And every
    time you receive a packet of data, you send an acknowledgement, a
    simple, smaller packet, same protocol number that says I received
    up to byte 450 of your stream. Geoff. I can then look at my window
    and go, Well, okay, everything up to byte 450 has been received by
    George. I can delete that data from my sliding window, and then I
    can open the window up to compensate for the data that's gone and
    send you some more data. [George: Yeah] so it's a pacing protocol
    that says, Every time I get an ACK back from you, that says you
    receive data, that tells me, oddly enough, a packet has left the
    network. I go, ah-Ha, there's now room on the network for another
    packet. So you can think of this as a gigantic conveyor belt.
    [George: Yes] every time I get an ACK frame, I send the data frame
    back in, and the belt has precisely the same amount of data
    continuously. Data goes forward, ACKs come back, steady state

George Michaelson  24:29

    Right? And if we look at this, both you and I have some degree of
    influence over the rate of play. If I choose not to tell you I've
    got things you're not going to send when your buffer fills up,

Geoff Huston  24:40

    And I'm going to stop.

George Michaelson  24:42

    And if I have told you, you're free to send things, you're also
    free not to send them, but you can, in principle, control the
    rate, because you can choose not to send we both have a role.

Geoff Huston  24:54

    Once you tell me you've got it, that's it. I've forgotten any
    anything I did. It's over. But yes. Is, how fast should I put
    things onto the network? Is not given here. We don't know,
    [George: yeah] So if you look then deeper into the network, you
    find there are two things. There's transmission, which is indeed a
    pure conveyor belt, [George: yeah] I put a packet in, it goes to
    the other end, speed of light, speed of, you know, signal through
    cop or whatever, and pops out the other end, totally
    deterministic. It might get corrupted, bit errors, radio whatever,
    but on the whole, packet in, packet out. But the rate of sending
    is not necessarily equal to the capacity of the next circuit. So I
    need to adapt. And the way we do this is every single link has a
    buffer, a queue, so that when a packet gets to a switch and it
    says, Well, dear packet, you need to go out interface three, but
    it's busy right now coping with another packet, you're gonna have
    to wait until I've cleared that packet before sending you so it
    pops it in a queue. It's human behavior. You know, you want to get
    onto this road in your car. The road is full. Wait for your turn,
    buddy. It's exactly the same principle. So the network is full of
    queues and links. [George: Yeah] Now imagine what happens when I
    send a lot of data a lot it overwhelms the link, and then the
    queues start to fill, because I'm still sending and then the queue
    gets full, [George: right] full, full. What do you do with the
    next packet?

George Michaelson  26:28

    You got to throw it away. Geoff. There's nothing else to do.

Geoff Huston  26:31

    It's IP underneath all this. Every packet is an adventure. Throw
    away. A packet is perfectly fine. Perfectly fine.

George Michaelson  26:39

    You and me sending signals. I got that. And we've got queues come
    in between, which means along the path between you and me, people
    can stack up a little pile of them and let go. And we've now
    arrived at if you're one of the people stacking up a pile of
    packets and your stack is full, what are your options? And your
    options are throw something away.

Geoff Huston  27:02

    You've got no other choice, so it doesn't actually require a
    source quench or anything. You just throw it away, [George: right]
    That's that's perfectly fine, but let's think about the poor old
    receiver. I receive packet three, ACK, I receive packet four. ACK,
    I receive packet six.

George Michaelson  27:18

    Hang on.

Geoff Huston  27:19

    Oh, hang on. Hang on, where's five? I'm going to wait almost no
    time. It'll be packets might have been reordered, but pretty soon
    I'm going to say, hey, Geoff, I'm not going to ACK packet six,
    [George: yeah], because I'm missing five. So tell you what, bud,
    I'm going to do a duplicate acknowledgement of four, saying I've
    got up to four, but you cannot get rid of five because, you know,
    I haven't got it, [George: yeah] now that duplicate ACK is an
    unambiguous signal of loss, and it tells me two things that are
    interesting. One, I need to resend packet five, but the next one
    is a bit of a leap, but it's true. I'm going to assume I was
    sending too fast.

George Michaelson  28:03

    So it actually requires two behaviors in you. You said with the
    window, when you've received an ACK, the past is gone and you
    forget about it. You've just slightly modified that the past is
    gone, but remember the last thing you had ACK'ed, so that when you
    open your window and send stuff. If you see a repeat ACK against
    the last thing you send, you've been told loss has taken place.
    You have to remember a little bit more. You can't forget all of
    the past.

Geoff Huston  28:32

    Well, the number is actually a sequence number, and it's the
    number of the last acknowledged byte. And so the calculation is
    really easy. The first packet in the queue has a byte number. If
    the ACK value is less than that number, it's a duplicate ack,
    [George: right] So I'm not dealing with packet numbers, 12345,
    oddly enough, TCP is a streaming protocol, and you actually number
    the bytes, which is its own problem, and it's all about high
    speed. And let's not go there because, you know, we haven't got
    time today. But you know this, this protocol is kind of magic in
    so far as if I interpret duplicate ACKs as I'm going too fast,
    [George: yeah] then I need to stop sending to let the cues drain.
    Oh, good. How long should I stop sending for? Oh, I don't know, as
    long as the queue is big. Oh, now there's a clue. What if the
    queue held the same number of packets as the link that it drives?
    So if the link could contain three packets, let's make the queue
    contain just three packets,

George Michaelson  29:39

    right? But that's magic knowledge. Geoff,

Geoff Huston  29:41

    No, no, this is the design of networking. This is the bit that
    fills in the gap of what do you do when you get a duplicate ACK,
    if we build the networks such that every link buffer is equal to
    in number of packets the size of the link that it's driving.

George Michaelson  30:01

    So every point to point link, every party at either end of point
    to point link, makes its buffer equal to the number of packets
    that can legally exist in flight between those points. Then by
    definition, if something goes wrong and you have to calculate the
    question, how big is the queue? You know,

Geoff Huston  30:25

    I know it's the same as the amount of data I can send in one round
    trip time interval, the time it takes to send a packet from you
    and an ACK back. Wow.

George Michaelson  30:37

    So am I right in thinking that this actually really really, is the
    law of physics coming up Geoff that if we've got a length of
    fiber,

Geoff Huston  30:45

    no, no, no, it's not.

George Michaelson  30:46

    If we've got a length of fiber between you and me that's a foot
    long, we can store less data in it than if that is 1000 kilometers
    long.

Geoff Huston  30:55

    Yes, of course, it says physics. Why? I said no, the law of
    physics this will get back to buffer dimension in the second. But
    the whole issue is I now know how long to wait. I wait for that so
    called round trip time, because every time I do send you a packet
    and I get back an ACK, I've got my good clock. We've all got
    clocks. Tick, tick, tick, tick, tick, you're 22 milliseconds away.
    So when I get a loss signal, I know how long I should back off for
    to let the cues drain. [George: Yeah]. Because if I don't send any
    more, then basically what's in the queues will flow outwards to
    you. And if I resume at the end of one round trip time, guess
    what, you will receive a constant flow without interruption, you
    will see no idle time.

George Michaelson  31:44

    So you know two things. You know the byte number last ACKed, and
    you know the round trip time between those points, and you pace
    your behavior so that if you ever get a loss in that pause for the
    round trip time and then carry on

Geoff Huston  31:59

    at a lower speed.

George Michaelson  32:01

    Ah, three things,

Geoff Huston  32:02

    because you just build the queue. So obviously you were sending
    too much, buddy. So I do two things. I wait to let the queues
    drain. And now the next trick, I'm going to send at half the rate.
    Why half? Because the queue size is where the other half of the
    packets lived. And so it magically forms up that if you just think
    of this as one connection through a multi hop series of links with
    queues, then this just works. I keep on going to effectively twice
    the capacity of the path. I fill up the queues. I get a drop. I
    wait for round trip time. Queues go away. I start sending at half
    the rate. The next packet will fly straight at you because the
    queues have been drained. Oddly enough, efficiency 100%

George Michaelson  32:53

    Well, it sounds like there are periods when you're not using the
    link, and you have to assume someone else fills that hole.

Geoff Huston  33:02

    Oh, no, no, I haven't talked about, you know, "Hell is other
    people." I think, as John Paul Sartre said, I haven't talked about
    other people yet. It's just you and I on an otherwise idle
    network. And this is the theory of congestion flow control
    [George: right] now. We introduce the hell of other people. And
    this makes life fascinating, because I'm not the only person
    contributing to that queue. What should I do? There was a theory,
    and it wasn't a very strong one, but it was a theory that with
    multiple senders, each with different round trip times, each with,
    you know, all going through the same link, if you all use the same
    algorithm, two things would happen. Firstly, the queuing behavior
    would be most influenced by the biggest of the senders. Secondly,
    if everyone used the same flow algorithms, you would each get 1/8
    of the share of that link.

George Michaelson  34:01

    But those two statements are in conflict, because if it's most
    influenced by the biggest, [Geoff: yeah], how can there even be a
    biggest when you've come down to having 1n? Doesn't that imply
    that some amount of the link isn't being used?

Geoff Huston  34:16

    It wasn't the best of theories, right? And after about 30 years of
    staring at networks and queues, we're kind of coming around to a
    new theory, [George: right] And it's a fascinating theory because
    it's actually helpful. Don't forget, the faster networks get, the
    more expensive it is for the memory in that queue. If I'm driving
    a terabit per second link, I need to pull a terabit per second out
    of that memory bank. What manufacturer makes memory that can
    deliver sustained a terabit of data from memory

George Michaelson  34:50

    that's not cheap memory

Geoff Huston  34:51

    Trick question, no, it's not even expensive memory. It does not
    exist.

George Michaelson  34:56

    So we're no longer bound by the speed of the link. We're bound. By
    the speed of getting things into and out of the buffer.

Geoff Huston  35:03

    Maybe it's the wrong theory. You see that buffer memory is the
    most expensive memory in the router. It's the most expensive
    memory you can buy. We use every parallelism trick in the book to
    try and make it scale up in speed, but it's really, really
    difficult, because speed has not changed. Moore's law is not about
    double the speed. It's double the chip density. The clock size has
    been static for years, but all of a sudden, these folks staring at
    queues came up with a fascinating answer, and it says, you know,
    it doesn't work like that dude, the buffer size is not just the
    bandwidth times the delay. The buffer size is the bandwidth by the
    delay divided by the square root of the number of sessions.

George Michaelson  35:47

    I hate it when people bring square roots to the table.

Geoff Huston  35:51

    I know. -Let's say a one terabyte per second circuit, and it has a
    100 millisecond of delay. Simple stuff, 12.5 gigabytes of buffer,
    right? [George: Yeah], but let's say that same terabit of circuit
    carries 100,000 flows. I don't need 12.5 gigs of memory. Do you
    know how much I need 40 megs?

George Michaelson  36:12

    Because you divide the initial ...

Geoff Huston  36:15

    by the square root of 100,000

George Michaelson  36:18

    if you have more people involved, the amount of buffer goes down.

Geoff Huston  36:22

    Savagely. goes down. It's it's kind of what saved our bacon.

George Michaelson  36:27

    So I'm kind of intuiting that what happens as a consequence is the
    buffer fills up, and more people get a signal, you just had a loss
    event, and more people have to go through some kind of backing
    down, and therefore more people balance their traffic out to be at
    that mythical one end, and at that point, everybody's pacing to
    match the buffer.

    This is a result that is experimental, not analytical. It's what
    we've observed.

    It's been seen.

Geoff Huston  37:00

    It's been seen and seen a lot, and it's, it's kind of, in some
    ways, intuitively obvious, particularly with that buffer bloke,
    topic that circulated for years that networks were continuing to
    over dimension buffers. Why? Because I can sell customers really
    expensive routers with a huge amount of high speed memory that is
    actually making life worse, not better?

George Michaelson  37:21

    Well, it looks great in the showroom when there's only you, and
    the other end, it's when you deploy it in real life with hundreds
    of 1000s of users, you start to find out it didn't behave the way
    you thought.

Geoff Huston  37:32

    So let's take that 40 megabytes. How much memory can I put in an
    ASIC that also does switching. So I've now got a chip that's
    divided into two bits, let's say 40% of the chip real estate I'll
    use for my switching fabric, the cross switching, right? [George:
    Yeah] And they've got 60% of the real estate that I can use for
    memory. [George: Yeah] Now, in general, I can do around about 40
    to 60 megs of memory on a single chip,

George Michaelson  38:01

    this special, incredibly high speed memory, because ordinary
    memory, the densities have reached the point that you could fit
    gigabytes, terabytes on the chip.

Geoff Huston  38:11

    Now I don't want to.. I want really, really high speed. So I've
    now got a terabit per second switch with 40 megs of buffer. And as
    long as I'm in the middle of a really busy setup with a huge
    number of flows, it will just work.

George Michaelson  38:31

    You need the high number of flows?

Geoff Huston  38:34

    So let's say if you're Amazon and you're designing your switch
    fabric, Lincoln Dale's presentation at AUSNOG this year. And let's
    say that you didn't want to buy vendor equipment because it
    contains 60 million lines of code that you're never going to use.
    Let's say you want to trim it down, automate the hell out of it,
    and just switch, [George: right] Well, you're inevitably linked to
    I'm going to do this on one chip. I'm going to use about 64 megs
    of memory in the check buffer, and all I'm going to do is switch.
    Wow, this works.

George Michaelson  39:05

    Get rid of all that smart, all that extra technology, all those
    complex additional features. Just do packets in, packets out, drop
    when buffer full.

Geoff Huston  39:14

    Not the only ones who have reached this conclusion, but you know,
    you can see now how we scale, how we actually make these enormous
    systems work. And it's kind of the initial theory, you know, this
    concertina like behavior of TCP, faster, slower, faster, slower.
    And then if you put them all together, the amount of buffer demand
    starts to drop, which actually means you can squeeze a huge number
    of sessions in on the one switching fabric. And oddly enough, the
    more sessions you can squeeze, the lower the aggregate memory
    demand. Wow, this is one case where practice actually runs in your
    favor. It's a stunning answer, isn't it?

George Michaelson  39:54

    So that's very unexpected. Geoff, because I thought we'd be
    heading to a middle ground where you're saying. These complex
    protocols like RSVP don't do the trick, but if we do just enough
    additional signaling, we can get the benefit. But you've actually
    said just stick with straight TCP and change your expectations of
    behavior and remove all the complexity from the switching
    components. It turns out things work better than you thought.

    I'm going to rephrase an old adage. I think it from Andrew
    Odlyzko, but I think I was guilty of saying the same thing as
    well. There is no problem that more bandwidth doesn't fix you just
    add more bandwidth. It's cheaper and simpler that if a real
    problem was I don't have enough network to carry the traffic. Stop
    trying to ration, stop trying to condition, stop trying to color
    the packets red, green and purple. You're wasting your time. Bring
    up another color, add bandwidth, because you can do that.
    Everything else works in your favor, because then you can get back
    to a very simple task. And what about TCP? Well, fascinating. Do I
    actually need to rely on loss. Or can I try and sense the onset of
    queuing is where we are at the moment in research, [George: right]
    And it's kind of well, loss is a bit catastrophic. It's like
    driving your car faster and faster and faster until you have a
    catastrophic Smash, oops. What about if I keep on driving faster
    until I nudge the car in front of me. No damage yet. No damage.
    Just nudge and then back off.

    Dear listener, please do not drive like this in any economy,
    anywhere on the planet.

Geoff Huston  41:35

    When Geoff is on the road, stay clear. Yes, same thing. But you
    know, you think about what's going on there. If I can just do this
    to when queuing starts, not when it collapses, then, oddly enough,
    the winner is not the queue. The winner is the TCP session,
    because I waste no time in repairing packet loss.

George Michaelson  41:56

    Right If you can target to knowing when you're using the
    appropriate amount of available buffer target it, then you don't
    incur the cost of a loss to be there. But that does beg the
    question, couldn't we have the simplest, the most lightweight of
    signal to flip down the line you're getting close?

Geoff Huston  42:14

    And this, this was an idea from the 1980s early 1980s from Digital
    Equipment. It was called the Explicit Congestion Notification

George Michaelson  42:23

    ECN.

Geoff Huston  42:24

    ECN, an otherwise unused bit in the IP header. And when you get a
    packet and try and put it on the queue, if the queue is looking
    pretty full, just mark the packet or so what at the receiver of
    that marked packet, because you're running TCP. In other words, is
    a backward flow of the ACKs. Take that signal, there was
    congestion on the inward path, and send that back as I saw a
    congestion signal in the TCP ACK,

George Michaelson  42:52

    wait, wait. You just, wait. You just did. If I can use a horrid
    legalism, didn't you just do a layer violation, that you put a
    signal in an IP header and you acknowledged it in a TCP
    transaction.

Geoff Huston  43:08

    Yeah,

George Michaelson  43:09

    that's weird.

Geoff Huston  43:10

    Yeah, I did well. It was kind of, if you want to know when queuing
    is starting, why not ask the router who was starting a queue? In
    fact, why not get the router to tell you

George Michaelson  43:21

    I kind of like this Geoff thinking about this, because you said
    earlier, ICMP couldn't cut it, because anyone along the path could
    stuff in line. But you've just inserted into this where you get
    the IP signal. You're going too fast. You get it before you've
    incurred a loss consequence, and you signal back in TCP. Somebody
    said something, there's a bit of acknowledgement that this event
    took place. Is this harder to forge?

Geoff Huston  43:47

    Well, we actually no, it's hard to forge, and it's been around for
    years. It was an experimental RFC, then a standards track RFC, and
    a few folk have given a lot of interest in it. Apple and Comcast
    have started a thing they call low latency, low. It's L4S, low,
    everything but high scalability, [George: right] Problem is it's a
    specialized solution for a specialized environment. The world
    doesn't do this in APNIC, we started measuring ECN, and you know,
    2 to 3% of users sit beside equipment that does ECN, it's not
    there.

George Michaelson  44:27

    So if I go to another story we've covered on PING, other protocols
    have observed the behavior that when you include fields in headers
    of packets to say we think this might be useful later. There's a
    thing that's been called greasing. You need to periodically change
    the value you send because otherwise intermediate systems go that
    field is zero. If it's not zero, I'm throwing the packet away.

Geoff Huston  44:53

    I'm making the field zero. And ECN suffers from that your marked
    IP packet gets reset

George Michaelson  44:58

    Because we didn't do greasing down in the IP layer. When we said
    how to do IP these bit fields can't reliably be sent in the public
    net. But if we're talking me and my fabric, I can run systems that
    do this. And if we're talking me to a CDN, they could run systems
    that do this.

Geoff Huston  45:17

    If you're in a closed shop, go for it. If you're out on the big
    public Internet, doesn't work. So how else do you do it? Well, I
    get back to BBR, the bandwidth bottleneck protocol, which is
    fascinating. How do I know when queues are forming? Ah, I don't
    get out a little hammer. I get out a big mallet, and for one round
    trip time I hit the network with my mallet. What do you mean? I
    send 25% more packets for one RTT interval,

George Michaelson  45:48

    A lot more than you think you really should.

Geoff Huston  45:51

    25% more than I was sending. [George: Yeah] and it's kind of did
    you feel that network? How do I know if you felt it? I look at the
    delay. Because if those 25% extra packets ended up in the queue,
    then for that round trip time, the measured round trip time will
    get a lot longer. And if that happens, I know I was already at the
    sweet spot. I was already at the point where queues were starting
    to form because the one interval when I dramatically went
    overboard, I saw an extension of time to get the packets through,
    because queuing was happening. And don't forget, in the next round
    trip time, send 25% less to drain the damage you just did.

George Michaelson  46:37

    You know that you were closer to the sweet spot? It might have
    been a little ahead of you, but you don't think it's radically
    ahead of you.

Geoff Huston  46:45

    Whereas, if there was no change in the round trip time, I can go
    25% faster immediately. It's a very aggressive protocol.

George Michaelson  46:55

    It sounds like it has a risk of being greedy against other
    people's behavior.

Geoff Huston  46:59

    Again, we get back to the two things I was talking about. The
    start of this, efficiency and fairness.

George Michaelson  47:05

    Fairness.

Geoff Huston  47:06

    BBR does this trick once every eight round trip times. So if
    there's idle capacity, it won't be idle for much longer. At some
    point, the mallet will come out within eight round trip times, and
    you know, BBR will hit the network go, can you take more traffic?
    So it is extremely efficient in finding the ceiling. Is it fair?
    Ooh, when you've got other BBR sessions, it's kind of fair, but
    it's not very fair. One session might end up winning, the other
    one losing, and you can't tell, is it fair to all the other things
    that are still out there doing loss based congestion?

George Michaelson  47:48

    No.

Geoff Huston  47:48

    Oh, yes, exactly. It kind of overwhelms them because it's more
    aggressive [George: right] your elbows are longer, a lot sharper,
    and they reach further. You tend to put undue pressure on everyone
    else. And the designers of BBR have gone through BBR version 2,
    which was kind of too wimpy. It sort of cut off the aggressive
    parts, and in the end, it deferred all the bandwidth to all the
    loss based systems didn't end up with much favor.

George Michaelson  48:17

    So again, this is experimentally driven, [Geoff: Oh, yeah] And we
    tune this right? BBR's hunting for a magic point. And that point
    is, be a little more aggressive to test if there's room, but do a
    better job of fairness when you decide you need to coexist. That's
    that's really quite an interesting outcome. Geoff, that's going to
    an interesting place.

Geoff Huston  48:39

    Well, that's BBR v3 which is still being worked on. But, you know,
    tweaking is happening from purely selfish perspective. I really
    like BBR v1 but, you know, like driving a bulldozer everyone else
    clear out because you've got no choice. So it's kind of
    fascinating, George, that the answer is not to put more stuff in
    the network to make the network operate fairly and efficiently,
    it's actually to be clever in the end points and treat the network
    as a very simple, very dumb set of switches and queues. And to my
    mind, that's been the entire magic of the Internet protocol for
    the last 30 odd years. And what do you say to SRV6 and RSVP and TE
    and so on? You might have a use case in your private world. That's
    fine, [George: yeah], but if you try and think this is good for
    everyone, [George: no] I think that's delusional. Yeah. It's just
    not true.

George Michaelson  49:36

    The ECN story in this is perhaps a little unfortunate if we'd had
    the concept of greasing and if we had a clear channel signal for
    something mechanistically simple, like ECN, I personally think
    that might have been nice, but, you know, wanting a rainbow pony
    doesn't get you one, right? This is the world we live in.

Geoff Huston  49:56

    Do you do the ECN marking when the queue is full? When the queue
    is Empty when the queue is half full? and I'll tell you what, if
    there's choice, everyone will have a different choice point. What
    does ECN marking actually mean? Should I react quickly? Do I have
    time? And so all those uncertainties, I think, sort of came with
    ECN was an interesting approach 20 odd years ago, and it's
    interesting in a closed world, if you can define the space
    absolutely, [George: yeah], but out there in the big brainwashed
    Internet, no, no, push it out to the edges. Make the end TCP
    protocol engines do all the work, and they're quite capable of
    doing a really good job.

George Michaelson  50:36

    Yeah. I think it sounds like people should maybe tune into the
    recording of Lincoln's talk at aus not can get a handle on how
    Amazon's doing this,

Geoff Huston  50:44

    Assuming they recorded it. No promises here, but you might want to
    listen to some of the talks about high speed data center
    switching, and there's a few going around right now about you
    know, what does speed mean, and how do we achieve it? And part of
    this we've got so specialized these days, in those data centers,
    you can't just use commodity rack equipment. You really are
    saying, if I want terabits running through this equipment, I need
    to actually go very close to customized solutions.

George Michaelson  51:14

    But those customized solutions, it turns out, are significantly
    simpler in some senses.

Geoff Huston  51:22

    Oh yes, it's like a racing car. You take out all all the bits of
    comfort, you know nothing's left that I switch and I have memory.
    What else do you do? I don't know. There's nothing else to do.

George Michaelson  51:32

    Yeah. Geoff, that's been fascinating. Thank you. It's a pleasure.

Geoff Huston  51:35

    George and I hope the listener has found that interesting. You can
    find a write up of my work with trying to measure ECN on the
    APNIC blog. I think it'll be out very soon, certainly by the
    time this is aired about our efforts to try and capture ECN in
    the wild and what it might mean.


George Michaelson  51:52

    We'll make sure that references are put into the blog that goes
    with the podcast. Thank you Geoff.

Geoff Huston  51:56

    Thank you George.

George Michaelson  51:59

    If you've got a story or research to share here on ping, why not
    get in contact by email to ping@apnic.net or via the APNIC social
    media channels, also remember the measurement@apnic.net mailing
    list on orbit is there to discuss and share relevant collaborative
    opportunities, grants and funding opportunities, jobs and graduate
    placings, or to seek feedback from the community on your own
    measurement projects, be sure to check out the APNIC website for
    All your resource and community needs until next time you