Geoff Huston  0:00

    You know, it's often said in places in the world where you have to
    toot your horn to stop accidents, that you know, the person who
    has an accident wasn't tooting often enough and loud enough, and
    that was the problem. And the problem with this bit bonding wasn't
    the fact that there were engineers down in the voice systems and
    so on doing things on the circuit. It's the fact that you didn't
    cross all your toes and all your fingers enough and weren't
    genuflecting often enough. It's all your fault, equally silly in
    terms of attribution engineering. But you're right. You know,
    trying to fit a more stringent application on top of a system
    engineer for something completely different was always going to be
    difficult. But the point was, the only way you got better speed
    out of this system, at least at that level, scunging voice
    circuits was to try and sort of sticky tape the voice circuits
    together.

George Michaelson  0:56

    You're listening to ping, a podcast by APNIC, discussing all
    things related to measuring the Internet. I'm your host, George
    Michaelson, this time, I'm talking to Geoff Huston from APNIC labs
    again in his regular monthly spot on ping Geoff and I talked about
    making things go faster than the visible bandwidth of a single
    link. How is that even possible? We've been using the same basic
    techniques to manage this problem since the dawn of digital
    communication. We send different bits of the data, stream down
    different links and reconstruct it at the other side. When you
    come down to it, there isn't really a good alternative, but this
    introduces a problem of sequencing. What do you do when the bits
    arrive out of order, or worse yet, not at all. Different protocols
    and different layers in the protocol stack have now come to the
    fore in this engineering problem. Geoff reviews some of the past
    history and the kinds of problem we're now hitting as the
    underlying network gets faster and faster. Geoff, welcome back to
    ping. What should we talk about this time?

Geoff Huston  2:03

    Oh, hi, George. Look. It's good to be back today. I actually want
    to talk about, how can you make things go faster than the
    components underneath them?

George Michaelson  2:12

    Throw them harder.

Geoff Huston  2:13

    Yeah, blow at it harder. How do you kind of extract greater
    performance than any individual element can? And this is not a
    problem that's unique to networking. I remember in the days of
    mainframe computers, which, you know, there are still few of them
    still around, I guess, and the way they kind of solved the problem
    of, how do you make a bigger, faster computer, was kind of put two
    of them together double the speed. That's not linear speed, but it
    certainly double the through point.

George Michaelson  2:43

    Do you remember Seymour Cray's comment about people who were
    ganging up processes? He said, Would you rather have two oxen or a
    million chickens pulling your plow? So there's kind of history
    here that scaling up by doing more in at the same time, there's
    some skepticism.

Geoff Huston  2:59

    I have news for Seymour, I really do. And the answer is a
    retrospective high from a million chicken world, because, you
    know, we went there.

George Michaelson  3:10

    Let's put it back in the network, in context, though, what about a
    network and the limit of speed that you've got available? What is
    it here that's allowing you to get faster than you think you can.

Geoff Huston  3:21

    Well it's exactly the same approach with computing that, in
    essence, if you're trying to get more jobs through a computer, and
    not necessarily this job to go faster, but just more jobs per
    hour, then assigning more computers to the problem, more
    processes, etc, going parallel, actually solves your problem. Now,
    in some ways, it's not the most efficient solution. Typically,
    when you scale up, you get benefits that are way in excess of
    linear and simply adding more plus one plus one plus one is more
    of a linear problem. It doesn't make the unit costs any cheaper,
    but it does solve your problem. Now we've encountered this in
    networking, probably from the year dot. And by dot, I mean, you
    know, somewhere around late 70s, early 80s, we started building
    computer networks. leaching parasitically On the back of telephone
    networks. And at the time, the telephone networks are just going
    through a tranche of digitalization, that rather than carrying
    analog sort of streams of signal for each voice conversation, we
    actually digitized them. And inside the telephone network, if you
    could peel apart all these virtual switching layers, what you find
    is that each conversation was actually a stream of bits, right?
    Yeah. And the way the human voice works most human voices is that,
    as long as kind of the basic aim is intelligibility and clarity,
    you can digitize a human voice stream by doing, oh, 8000 samples a
    second,

George Michaelson  4:58

    really quite a small number. When you think about it, it's not a
    massive amount of sampling, is it?

Geoff Huston  5:04

    Well, George, I don't know about your soprano voice, but mine
    wouldn't even make eight kilo hertz,

George Michaelson  5:11

    yeah. The thing is, both of us worked at the tail end of this
    process in telecommunications related roles. I was working for a
    minor player, while you worked for the major player, the former
    monopoly telco, and we actually exploited a niche in the market of
    ideas here. You the big guys. You were meant to use the full width
    of a voice channel to send a message between two points long
    distance, because otherwise there would be too much quality loss
    this weird idea called the QDU, and we, as the on top carrier, we
    were allowed to compress and send either two or four voices down
    the same channel. And it was a really funny business working out,
    oh, the quality was not good at the four channel end of things,

Geoff Huston  5:58

    yes, I was getting to 8000 samples a second. And in the original
    spec, each sample effectively said, I'll give you one of 256
    different volume levels. And so every 8000 of a second, it sent an
    eight bit value, which was the volume at that point. And that
    would actually accurately reproduce the signal of anything up to
    four kilohertz, and your best soprano voice and mine can't go
    above about 3.9 kilohertz. So it kind of worked, right? But the
    issue is, yes, you can, with enough cleverness in the way you
    encode it and decode it, cram more in. And as soon as sort of this
    standard came along, then along came seven bits at 8000 hertz,
    probably enough used by the Americans, 56 kilobits. And then came
    four bits at 8000 kilohertz, which is two to one compression. And
    I even got this further, actually getting down to two bits at 8000
    hertz, which is one of the higher compression algorithms, and they
    used that in the mobile phone industry, and just sort of squish
    the voice down to the point where, as you say, it was pretty
    heavily disordered, [George: yeah] but by God, it made use of the
    network. Okay?

George Michaelson  7:18

    We stopped talking about QDU use very rapidly in this world, we
    stopped actually having some idea of the the amount of nastiness
    in the voice call we would tolerate, because we realized people
    would actually tolerate a hell of a lot more nastiness for a
    cheaper call.

Geoff Huston  7:36

    The telco benchmark, I think, was 13 QDUs for any conversation,
    including international, which was a bit of a joke, really. But I
    come back to our story. You see, we started networking as an
    overlay across basic circuitry, which was, I'll strap up a to b
    and I'll nail a voice call, because we are a voice switching
    network. And if you lived in America, that was the 56 kilobit per
    second line seven bits and 8000 cycles a second. Oddly enough,
    because of framing and signaling, in the sort of European slash
    British oriented world and included in Australia, you got 48
    kilobits because they sold two bits on each frame for signaling
    and everything else so fast as you could go from A to B was 48
    kilobits. Interesting. I want to go faster. Fascinating. What are
    we going to do now? I can't give you a single 96 kilobit per
    second circuit because I don't have one. I can give you 2 48
    kilobit per second circuits. Great. Okay, what can I do with them?

George Michaelson  8:45

    How do you manage having two things that are kind of half what you
    want? Can you use them in a way to make it look like one thing?

Geoff Huston  8:55

    Well, that was the first kind of approach, and it persisted for
    quite some time in the industry, but only really worked kilobit
    per second rate, which was called bonding, where you actually look
    at these two circuits and you assume fingers crossed, toes
    crossed, you know, face in the appropriate direction, at genuflect
    many times, that both circuits will remain absolutely stable. And
    then you go with your bits, 1010, your AB, AB, AB, AB, and hope
    the other end locks in and that it sort of puts the bits back
    together in precisely the right order.

George Michaelson  9:32

    Yeah, because an AA, BB moment isn't doing you any favors
    whatsoever here, right? You need to be able to reliably sequence
    everything you write down this.

Geoff Huston  9:43

    yes, I remember, we did a leap from Melbourne to Perth about a
    year and a half on bonded 64k circuits. We managed to get as high
    as 256 kilobits per second. But beyond that, it's not going to go
    there. So bit level bonding kind of works but the underlying
    fabric needs to be absolutely rock steady, because otherwise the
    receiver is kind of clueless.

George Michaelson  10:06

    Yeah, and you have to also remember, this is technology that was
    never, ever constructed and built out to be clean, for digital. It
    was built for voice. It was built for analog signaling. And an
    engineer in a switching fabric could have reasons that he needs to
    randomly put another 200 meters of cable between two points in
    this system for a number of reasons. These two things might no
    longer actually have the same imaginary length between A and B.
    There are any number of reasons they could get out of sync with
    each other.

Geoff Huston  10:37

    You know, it's often said in places in the world where you have to
    toot your horn to stop accidents that you know, the person who has
    an accident wasn't tooting often enough and loud enough, and that
    was the problem. And the problem with this bit bonding wasn't the
    fact that there were engineers down in the voice systems and so
    on, doing things on the circuits. It's the fact that you didn't
    cross all your toes and all your fingers enough and weren't
    genuflecting enough? Yeah, it's all your fault, equally silly in
    terms of attribution engineering. But you're right. You know,
    trying to fit a more stringent application on top of a system
    engineer for something completely different was always going to be
    difficult. But the point was, the only way you got better speed
    out of this system, at least at that level, scunging voice
    circuits was to try and sort of sticky tape the voice circuits
    together.

George Michaelson  11:27

    Yeah, use them in a way that, although you absolutely knew at some
    lower level, these are four discretely different things, you
    treated them through infrastructure costs that you had to wear
    both ends, you treated them as one fatter pipe, four times the
    width, probably losing a bit of overhead managing this thing, so
    maybe three and a bit times better. But

Geoff Huston  11:49

    yeah, you can actually get all four if you use the technique in a
    router rather than in a bit level driver. So instead of down at
    the line driver circuit, you're going a b, a b. Christ, I hope
    this is working, A, B, was it a, or was it B? A, B, instead, up at
    the router, you have interface a and interface B, and your
    currency is now packets, [George: right] And so you can take these
    individual packets that you're switching through the router and a
    naive person, a very naive person and silly, as it turns out,
    might simply go, you're all destined to the same next hop. You're
    all going to go to the B end, whatever your ultimate destination
    would be, packet a. Interface, one, packet B, interface, 2 121212,
    and because the maximum size of packets is actually pretty low,
    1500 bytes or so, the two systems won't get out of order that
    much. You know, you won't be transmitting a huge packet for three
    hours down one line while the other line is taking all the load.
    You won't be doing that when you do sort of packet level, you
    know, alternate switching, things don't get out of order too much,
    but they do get out of order, right?

George Michaelson  13:00

    So there is a cost consequence emerging here.

Geoff Huston  13:04

    Well, let's think about now going up a level in the protocol stack
    to our good old friend TCP. You see, TCP is the piece of magic. So
    far, when I'm doing a B, A, B, packets might get out of order. I
    have violated none of the rules of IP. You know the golden rule?
    It's a datagram network. A, I'm allowed to drop packets what, of
    course, I am every packet is an adventure. B, I'm allowed to
    reorder them, really, Yes, yep, the data grab.

George Michaelson  13:33

    No guarantee of order of delivery, if you're just at the IP label,

Geoff Huston  13:37

    yeah, no guarantee they won't get duplicated. All kinds of things
    happen at IP, and all of them are legal. All of them, it's TCP job
    to try and make sense and say, here's an ordered sequence of
    packets, 123456, and present them to the application correctly.
    It's the engines at either end of this network connection that
    have all the work to do, right? So they're sitting there analyzing
    individual TCP packets that come to them, going, I guess you're
    the next one in order. That's good. I'll acknowledge it. Tick,
    send back an ACK. You know, I've got this packet. And that's how
    you kind of reassemble things. And when you go, 1 2 4 5 6, it's
    going, Oh, I'm missing number three. Hello, hello. The last good
    packet I got was packet two. Let's resume at packet two and press
    on, which seems a bit useless, but it kind of works, right?

George Michaelson  14:32

    Well, it kind of works, but it incurs this problem that it's
    almost like you throw away all the subsequent acts. If only you
    could come up with an engine that said, I'm going to hang on to
    things a little bit longer and maybe ask just for the hole to be
    filled in. You could do a little better, but that means you got to
    hang on to things you're adding delay.

Geoff Huston  14:52

    Well, there's an odd part about all of this. So in TCP, an out of
    order packet, a packet that kind of is. At the next one in
    sequence is treated as well. I'll hang on to it, but I'm going to
    send you back an indication that there are lost packets. So when I
    received packets 1 2 4 5 6, the real information I need to convey
    back to you is I lost it at packet three. I was expecting three. I
    got two, but I didn't get three. Whatever else you've said, I'm
    missing three. So what I do is I ACK number 2 4 4 5 6, so the ACK
    stream goes one, ACK two, ACK. When I receive four, I go, two.
    ACK. When I receive five, I go, two ACK. When I received six, I'm
    shouting at you. Doesn't matter. Well, it does. The theory is you,
    because you keep a copy of everything you set until it gets ACKed,
    realize eventually, because, you know, I'm getting a bit branch of
    this point, the three hasn't been ACKed. So you move back your
    send pointer and you send three. Now, if you just send three and
    nothing else, in this sort of theoretical model, I put three in
    the hole that I just have, [George: yeah], and then I notice that
    I've got, I received three, and I've got four, and I've got five
    and I've got six, ACK six,

George Michaelson  16:19

    right So you can fill in the gap at the cost of keeping enough
    buffer to hold those packets right, and there's a bit of delay
    here. Facing upwards, you can't pass things on in clean conscience
    til that hole has been filled

Geoff Huston  16:33

    right. So you need a fair deal of knowledge and time of
    sensitivity that you know, quite frankly, doesn't exist very much.
    So out of order packets are a nightmare. TCP horrendously
    confused. They cost because out of order packets are assumed to be
    lost packets, and in the worst case, go back at the loss point and
    send everything again.

George Michaelson  16:53

    Yep. So I figure Geoff, based on your line of reasoning, that
    where we're going with this is, well, if you think you get out of
    order packets when you've got one thing underneath you? What do
    you think happens when you've got two or three or four, right? Is
    that where we're going?

Geoff Huston  17:10

    Well, that's part of it. The problem gets magnificently worse for
    TCP, right? Magnificently worse because once you get three
    duplicate ACKs in TCP, the conventional TCP, the TCP of the
    biblical age says, ah, that's catastrophic. Shut down. Everything.
    Start fresh. Let's do a slow start again. You give it per second.
    No, no, no, not anymore. Close that stuff down. Shut down, because
    I've just had three duplicate ACKs. So this is a disaster. One
    packet again,

George Michaelson  17:44

    three in a row. Hello. That's not good.

Geoff Huston  17:47

    So you were ganging together four lines, and if you lost one of
    the packets on one of the lines, you guaranteed you were going to
    get three duplicate ACKs, if you're just doing it that way. So you
    think about this, and you think something's missing, and this is
    not working for equal cost multi path. What can we do?

George Michaelson  18:03

    Equal cost multi path? That's kind of the name for the situation
    when you know you've got four things, and you think they're pretty
    much the same weight, and you think they're the same delay, and
    you want to treat them as splits of something. You're going to
    gang up together to make a fatter unit to get between two places.

Geoff Huston  18:22

    I've got 4 64k circuits, and I want to mimic a quarter of a meg.
    I've got four ten gig circuits, and I want to mimic 40 gig, Yeah,
    same problem, just 1000 times faster, but same problem.

George Michaelson  18:34

    So equal cost, multi path. That's a great phrase. I'm going to
    hang on to that

Geoff Huston  18:38

    well, fair enough. And the issue is, what could you do about now,
    before I talk about going on with TCP, I just want to talk for a
    second about that evil word fragmentation. IP fragmentation.

George Michaelson  18:53

    Oh, we have talked about that. We have talked about that so many
    times, particularly about IPv6

Geoff Huston  19:01

    Yes, I was gonna say evil, particularly if you're very got your v6
    glasses on your head. Yes, very evil. But oddly enough, it is
    actually amazingly resilient for packet reordering.

George Michaelson  19:14

    Wait, run that one again. Frags are bad, but, but

Geoff Huston  19:19

    when I effectively take a set of frags of one packet. So I have a
    single packet, and I apply the fragmentation dicer, and I slice
    and dice this packet, and then I send them to you in purely random
    order. No attempt to do any kind of ordering, none. And as long as
    I send them continuously. I sort of push it out through all my
    paths simultaneously. Yeah, the other end goes yummy, yummy,
    yummy, not a problem. urgle urgle, urgle high up a layer is the
    completed packet.

George Michaelson  19:53

    So it'll de duplicate them, it'll reorder them, and it'll pass up
    a valid packet, if you just do. Your worst and chuck the fragments
    out there right,

Geoff Huston  20:03

    because there's a piece of information in an IP fragment that
    isn't necessarily there in a TCP fragment. And it's actually, if
    you will, the fragments address within the packet that you just
    sliced and diced. Hi, I'm fragment number three. I'm fragment
    number seven, I'm fragment number two.

George Michaelson  20:22

    The fragment knows where it fits in the entire packet. You know,
    you're building up a picture of what you've got and what you're
    missing.

Geoff Huston  20:29

    Yes, and if you actually had used that kind of sequencing
    addressing in TCP, TCP would be equally resilient. But we didn't
    do that. We used a single sequence counter, which doesn't give you
    much information at all, whereas in IP fragments actually use two.
    I'm packet number, n of m, fascinating. So one of the ways to get
    around this is to fragment everything.

George Michaelson  20:56

    I don't think really going there.

Geoff Huston  21:00

    I'm not. I'm not. The other way of doing this actually, is to
    make, I suppose, an assumption which the purists of the world, and
    I think the last purist died in about 1980 the purists of the
    world would say, is a layer violation. You reach in and look at
    the header of the transport session. Oh, look, there's a source
    address, a destination address that's in IP there's a source port
    and a destination port, no, that's a TCP or UDP construct, and
    there's a protocol number, I'm TCP or I'm UDP. So the protocol
    number, the two port numbers and the two IP addresses, are
    actually unique signature of a session. [George: Yeah]. So if you
    and I are having two conversations simultaneously, one of those
    values would be different, [George: right] Probably one of the
    port numbers. So

George Michaelson  21:51

    the packets arrive at the air quotes same place because the same
    source and destination address are on it. And it's possible one of
    the port numbers is the same port number, but one of the other
    ones is going to be different, because someone has said, Give me a
    random port to send some stuff. You get a different random number.

Geoff Huston  22:09

    So let's say I take this value now. That's two addresses, 32 bits,
    two port numbers. That's another 32 bits. Yes, 16 plus 16 plus 30,
    sorry, 2 32. 64 I can't add the staff this evening. 64 plus 16 and
    one, yep. 81 bits, and I hash it. Let's say I've got four lines, 0
    1 2, and three. I hash this into a value between zero and three.

George Michaelson  22:38

    We're going to have to start calling this podcast the hash
    podcast, because you were talking about hashes in the context of
    NSEC3 DNS last time, Geoff. Hashes are unbelievably powerful and
    useful, aren't they?

Geoff Huston  22:52

    We should pay mathematicians more money. They're brilliant people,
    the life blood of civilization. I speak as a reformed
    mathematician,

George Michaelson  23:01

    I was just about to say, what is your degree classification? So
    you take the four unique values, you get an 81 bit number, and you
    generate a hash from it,

Geoff Huston  23:10

    And the same sort of values in that 81 bits will always give you
    the same hash, always right. But a random selection of ports, and,
    you know, destination addresses, etc, will give you a pretty good
    distribution across your, you know, zero to three across your four
    port numbers. So what this means, oddly enough, is, if you can do
    this quickly, and it's really easy to do, take 81 bits and hash
    them, return back, basically two bits with zero through four,
    however many I want. That's the selection which interface to use.
    Every session will go down. The same interface all the time.

George Michaelson  23:49

    Wait, wait, wait, wait, so if I'm trying to load balance, things
    happening through me, and I've got a cheap way of making four
    bits, four values in two bits pretty much cover an even
    distribution of 00 01 10 11, and the other thing you just said,
    the other part of this deal, is that if I am a particular pattern
    of source, destination, random port, specific port, I always wind
    up In the same two bit value, which means, if you're using that to
    address the port, I will always go down the same line, always

Geoff Huston  24:28

    right? So if I've got equal cost multi path across 4 10 gig paths,
    a single session will only ever see a 10 gig path. But if I've got
    a whole bunch of traffic going down these four by 10 gig circuits,
    which are equal cost, multi part a bunch of traffic, the aggregate
    throughput, I can push up to 40 gig.

George Michaelson  24:49

    You're moving the goal posts a little. I mean, not in a bad way.
    The net outcome is beneficial. You make efficient use of all the
    bandwidth you've got available. So. But no individual person for
    one TCP session is getting 40 gig. That ain't gonna happen.

Geoff Huston  25:07

    That's a very magic phrase you just said there, George, no
    individual TCP session can get more than one of the individual
    paths in my path group.

George Michaelson  25:18

    On the other hand, you are using your four links absolutely as
    efficiently as possible, even spread of traffic, very few wasted
    moments. It's quite simple for TCP to reconstruct what's gone on
    down that pipe less

Geoff Huston  25:33

    well. It's full of sequence, no out of order packets. So for a
    little bit of layer violation, little bit of layer violation,
    routers are now looking at TCP headers, whoops and UDP. A little
    bit of that, I get back an amazing, an amazing amount of payback.
    You know, how do we do these days, 800 gig circuitry?

George Michaelson  25:53

    Well, you can't buy 800 gig things off the counter. Can you you
    have to buy multiples and gang them up together somehow,

Geoff Huston  26:01

    you just said it. That's how we do it. And so a lot of this
    actually relies on this ability at the network layer to look
    inside the end to end transport layer and go, Look, I don't want
    to be a meanie here. I don't want to be a bad person. I will try
    and keep sessions together so that I don't give you a flurry of
    out of order packets, and then we're cool, aren't we? And the
    answer is, well, basically, apart from a few folk bleating in the
    corner down the back of the room, going, No, I'm not happy.

George Michaelson  26:30

    I'm not happy. I wanted all of it.

Geoff Huston  26:33

    I wanted all of it.

George Michaelson  26:34

    Can you do better? I mean, that's the golden question, right? Is
    there a moment out there where we could do better?

Geoff Huston  26:41

    Well, this is the issue of where do you want to do it, and by
    where, I mean network, transport, application. Course, you can do
    better. You just got to look at the right spot. And in this
    particular case, if you're using TCP, if you're using TCP and you
    really want to get much greater bandwidth across some of these
    scenarios, you use a very elegant, and I'll call it a hack, but
    it's actually an elegant piece of engineering called multi path
    TCP.

George Michaelson  27:10

    So we were on equal cost, multi path IP, and we are now on multi
    path,

Geoff Huston  27:17

    multi path IP,

George Michaelson  27:19

    multi path TCP,

Geoff Huston  27:21

    right where I establish multiple TCP sessions between the same two
    end points. Now they'll have the same source and destination IP
    whoopi doo, but they'll have a different port number of those pair
    reports. That's how you distinguish them. So I get a session to
    you with port one, another, session with port two, another with
    port three, oh, port 128, who cares?

George Michaelson  27:45

    Yeah, it does not have to be that the port numbers are in a
    sequence that is not part of the magic source here. The point is
    that there are four things each of us know is the other party. And
    if we talk into them, the other party gets what comes out. But
    Geoff, these are TCP sessions, so each of them is a reliable
    stream of bits, and I've now got to divide what I want to do into
    those four things, right? I mean, am I having to consciously chunk
    my stuff up into this? How do I do this?

Geoff Huston  28:16

    Yes, you are going to chunk your stuff up, and you're going to
    assign a chunk down a path, and so you've got independent TCP
    control sessions, and of your data you take, you chunk it up, and
    you assign a chunk to a single path. And that works actually
    better than you thought, much better than you ever possibly could
    believe. Why? Ah, there's this thing about TCP and friendliness.
    You see what TCP tries to do when there are multiple independent
    TCP sessions. Let's say there are 20 of them, is to equilibrate
    amongst them such that each independent TCP session gets its fair
    share, 1/20 of the bandwidth.

George Michaelson  28:58

    Yeah. Now there's a twist here. We've talked about a few times on
    ping that that really depends on all the TCPs having a similar
    idea of what fair means. But let's push that to one side, that
    they all agree about what fair means. They do a fair shares div
    between them,

Geoff Huston  29:18

    and with my single path TCP, I'll get one nth of the common
    network resource tops. [George: Yeah] with two TCP sessions, I'll
    get, hmm, two nth.

George Michaelson  29:31

    Because for the network, it doesn't know that these two things
    correlate. As far as it's concerned, it's got to work with you TCP
    to make fair shares happen, so you get two of the units.

Geoff Huston  29:43

    And in fact, it's not a conversation between me and the network
    TCP. Fair sharing is a conversation between me and all the other
    TCP sessions. So if I can exert more pressure on the network, I
    get more value. And by having, aww. 20, 100 1000 multiple TCP
    sessions. I'm the bully in the block. I'm pushing everything else
    to one side. So not only does it kind of get me around this issue
    of I can only go the speed of a single path in a multi path
    environment, I can actually dominate the multi path environment,
    woo hoo, and there's no way the network can actually arbitrate
    that without introducing more complexity,

George Michaelson  30:28

    right? But I've now got complexity, so I want to send two
    gigabytes of buffer to you, and I've constructed four TCP sessions
    between you and me. What tricks have we got on the shelf that will
    allow us to divvy these up amongst the TCP sessions? Do we do a
    version before pushing it out to TCP of that trick you were doing
    hashing on values and using the bottom bits as a distribution
    across the path? How do I manage this Geoff? I've got to do it

Geoff Huston  30:59

    Oh it now you can actually do this in much the same way as Bit
    Torrent works. A file is just a sequence of blocks on a disk, and
    let's think of it as a queue of transfer requests. Each one is a
    standard size, whatever the size of the file system might be.
    Let's say it's a kilobyte so there's a queue of one kilobyte
    requests. Now I have 10 multipath TCP sessions. So I assign the
    first 10 blocks to the first 10 TCP sessions. Off we go, yep,
    [George: yep]. First one to finish its block gets the next block.
    The second one to finish gets the next block. And so I don't pre
    determine which block goes through which path, [George: right]
    awesomely fast, it gets more blocks. And if one path is awesomely
    slow, it gets far fewer blocks,

George Michaelson  31:49

    Right And I've sat watching the visual display of how your blocks
    are moving in a bit torrent type situation, I use it to download
    ISO images when I'm doing operating system upgrades, and it's kind
    of a weird spatter gun of blocks being sent. It's not a linear
    order, so it fits the file model Geoff on the presumption I need
    all of the file to get my job done. But there are use cases in
    data transfer where this isn't necessarily a brilliant fit.
    Sometimes you want to work on partial content, and you've now got
    to back off and think, Okay, I've got limits on exactly what
    partial means here, because if you want to work on the first 10
    gig of this file, I can't randomly send all of the back of it.
    I've got to know that that's what you want to do. So there is a
    bit of kind of application awareness complexity that comes in
    here,

Geoff Huston  32:40

    right But it certainly solves a problem, doesn't it?

George Michaelson  32:43

    Oh, my goodness, I'm getting this file through 10 times faster,

Geoff Huston  32:47

    exactly for a particular class of problem. By utilizing the
    application layer, you can put more pressure on the network layer
    to actually improve your position against everyone else. Mine how
    you add but yes, improve your position. Is this a

George Michaelson  33:02

    theoretical model, or is this something people actually went out
    and coded?

Geoff Huston  33:06

    Oh, people coded multi part TCP is out there in the RFC standards.
    There are implementations, I dare say, if you foster around,
    you'll find it certainly for things like Linux. It was said that
    Apple implemented it for Siri. And I mean, but looking at Siri
    with a Packet Tracer and actually saying yes, Siri was implemented
    using TCP multi part bizarre.

George Michaelson  33:28

    I think I remember a conversation with you about how quickly Apple
    were able to ship downloads to people, and it looked like if your
    handset, your phone, knew it had both WiFi and cellular, they were
    doing some things to use both channels simultaneously to fetch
    those assets.

Geoff Huston  33:44

    And you're willing to, if you will, tell the device that the cost
    was the same in dollars. [George: Yeah], use it both. And as soon
    as you could say that, use them both. I can, I can actually use
    the WiFi and the cellular data connections and run the two
    together and just simply, whatever's fast just gets more data, but
    I can use them both at the same time. It's quite clever.

George Michaelson  34:05

    Wow. So this is not a theoretical thing. This is out there in the
    world as a real way of using more bandwidth underneath.

Geoff Huston  34:14

    It didn't get an awful lot of traction. Syria was almost the only
    app that used it in the Apple ecosystem at the time. I'm not sure
    it's expanded any further, but it's sort of an interesting case in
    point. But I want to progress the story.

George Michaelson  34:26

    There's more.

Geoff Huston  34:28

    Oh god, there's more. You see, the next thing to come along, and
    it's now been 10 years, is QUIC, right?

George Michaelson  34:34

    QUIC being another mechanism to get reliable bites between things.
    But it's kind of not TCP, is it?

Geoff Huston  34:41

    Well, it's TCP hidden inside UDP through encryption. So what you
    actually see is just a set of UDP packets. You can't see the TCP
    control systems. It actually has the support for having multiple
    TCP but it's all inside one single UDP flow state, you know,
    right?

George Michaelson  35:06

    Yeah, but UDP is not TCP Geoff.

Geoff Huston  35:09

    How do you load balance? QUIC?

George Michaelson  35:10

    Well, UDP is not TCP, so you can't do the trick of using packet
    tricks in TCP specific ways to get reliably the same TCP session
    down the same path, all the TCPs inside this UDP, so yes, Geoff,
    how do you make this fly?

Geoff Huston  35:28

    It's really hard because all the Embedded TCP sessions inside your
    QUIC session, fate share, all of them fate share down one path.

George Michaelson  35:36

    This isn't sounding good.

Geoff Huston  35:38

    Well, it's kind of a bit of a step backwards, because, again, many
    things are trade offs here, and by deliberately not exposing your
    structured flow information to the network, the network, it just
    cannot do compensation for multi pathing at the network level. It
    can't all the packets with the same UDP header sort of set go down
    the same single path you go. Well, I can live with that. The
    answer is, can you?

George Michaelson  36:06

    Sounds like we've just decimated the available bandwidth. For me
    as a consumer, the great, a good outcome might be all these UDPS
    are equipoising in the network providers balancing them over
    links, but I'm not seeing the benefit as me as an individual,

Geoff Huston  36:22

    that's kind of where it's heading, isn't it? And it is a bit of a
    to and fro between what you expose to the network and what you get
    as a benefit back from the network. But the issue is, quite
    frankly, you and I aren't given a say.

George Michaelson  36:36

    Hang on. What do you mean?

Geoff Huston  36:38

    Your browser, statistically made by Google called Chrome.

George Michaelson  36:42

    Yes, that's true.

Geoff Huston  36:43

    Is making decisions on your behalf along these very trade offs and
    seeing what's best for you. And you don't get believers. You don't
    get anything of that. They don't

George Michaelson  36:53

    right. So there is a sort of fantasy that we have a degree of
    control over these decision logics on how things are done. If you
    are a technologist, you can probably write systems to exploit
    this. But in the general, ordinary case of applications, on
    phones, tablets, laptops like this, it's not my call, it's some
    intermediary writing code, deciding how they're going to package
    it, and if they decided QUIC is the one, I'm not getting any multi
    path outcomes from what you're saying, there isn't a wait, there's
    more moment here. Is there Geoff

Geoff Huston  37:24

    not down this particular path? No, am I? As I said, I think this
    is trade off around we've decided we can't keep on increasing the
    clock. Clock speeds haven't gone any faster for decades now, and
    so the way we get more capacity out of the network is by going
    seriously parallel. Seriously. The other part of this so is also a
    look at how we do routing protocols, because that's the other
    problem inside all this space. The conventional view of a routing
    protocol is there is one single best path, [George: yeah]. So
    imagine, you know, peak hour, and everyone's going back to their
    house. They all live in the same spot for some reason. So what
    routing says is, you're all taking freeway 101, but there are
    other ways to get there. So what routing says there is one best
    part that is the best part, but it's clogged, it's full No.
    Routing says,

George Michaelson  38:17

    Do you remember the time you Me and Paul were going to IETF in San
    [Geoff: San Diego], San Diego, and Paul said, Let's drive then he
    couldn't go. We wound up with a car on i 405, in rush hour. And
    once you're on it, you can't get off it. You are basically on that
    road, crawling at 10 miles an hour until you get to San Diego.
    What a road trip?

Geoff Huston  38:41

    Well, that's what routing does to you. In packets. On networks,
    they don't have any feedback as to what's the best path. There's
    no Google Map View. There's no outer level that says That road is
    completely congested. Take another path. Routing can't do that.

George Michaelson  38:59

    Current routing the way we use it right now. BGP routing doesn't
    do that.

Geoff Huston  39:04

    Oh, every attempt to try and fold that kind of load feedback, and
    I'm going way, way back to the mid 80s with the HELLO routing
    protocol and similar where you try to sort of include a factor of
    its loading and efficiency, and give you a routing system that was
    a path that was the best path on the day at the time that second.
    [George: Yeah], the problem is feedback. Hey, everyone use I-102
    oh shit. Everyone's using 102 Hey, everyone use 101 Yeah, 102 101
    102 101

George Michaelson  39:42

    and we do actually sometimes see in BGP evidence of people who are
    using some kind of weird traffic management framework. So between
    10 and two, they do pattern A, and then between two and four, they
    do pattern B, as if someone is actually changing the levers,
    cranking it to. try and make a marginal benefit on it.

Geoff Huston  40:02

    It's been observed many times that amongst the lists of the most
    dynamic BGP updaters, the folk who spend all their time updating
    their BGP routes. A lot of it is attributable to the so called
    route optimizers, [George: yep] unless you've got your feedback
    loop tuned really well, and most of them don't. As soon as you
    push traffic one way, that creates traffic, and that then means
    you push it another way, and that creates traffic in you, and so
    on. And so nothing is stable.

George Michaelson  40:31

    So we've had this beautiful conversation about the ways you can
    multiplex up multiple things to get an efficiency gain and achieve
    something close to the best possible bandwidth between two points,
    and you're now saying, because of the way we do BGP and the idea
    of best path as a single thing, we can't do that in the routing
    plane. We've got no trick in the armory to make routing share load
    across two links that might otherwise look like really good
    choices, because we select one of them as best path. Well, what
    are we going to do?

Geoff Huston  41:05

    I actually think the answer lies in this area of QUIC and TCP
    multi path. And I suppose the observation is a pretty simple one.
    Think of the amount of silicon processing per packet in my end
    device in my laptop, it's a super computer. It has an enormous
    amount of processing capability, and its packet rate is not that
    high. You know, I can do a lot of clever tricks at this end with
    my traffic, particularly if I've got choices in the way. I can
    mark things. So if I establish 20 different sessions using, you
    know, multi path TCP, I can play with stuff because I have the
    process in power to do. So what about inside the network? What
    about in a router that has, I don't know, 40 800 gig channels
    connected to it? You can't breathe. There's no oxygen. You are
    just panicking. The packet rate is so high you've got about two
    cycles per packet, if you're lucky to get rid of the bloody thing,
    you can't be clever in the core of today's networks. You just
    can't

George Michaelson  42:06

    it's not the place to put these kinds of complexities into the
    story. Is it

Geoff Huston  42:10

    right? Your silicon is barely keeping, you, know, on par with
    fiber, and so all we do is just hunk in the fiber parallelize it
    hunk in dumb silicon, and do basic routing tricks with ECMP, and
    that's all you've got, and then basically hand the problem off to
    the edges of the network, saying, if you really want speed out of
    this, you need to be clever. The network isn't going to do it for
    you,

George Michaelson  42:33

    but you're kind of hinting at the possibility that, with a little
    bit of knowledge about path diversity, exposed to an application,
    and for the right kind of traffic behavior, I could chunk what I
    want to fetch into four different IP sources, which I know lie on
    diverse paths. And I could play this game a bit like BitTorrent
    does with multiple sources for the hashes, and get the benefit of
    discretely different paths if I'm prepared to do a little bit of
    work.

Geoff Huston  43:03

    Do we ever talk about Explicit Congestion Notification?

George Michaelson  43:07

    We might have talked about that recently. Geoff, yes.

Geoff Huston  43:10

    So here's a signal coming back from a path and a session that says
    the path you've chosen is getting a bit busy. And if you continue
    to do this, I'm going to drop a packet. That's what the bit is
    saying. And if you fold ECN with multi path, and if you make a
    large diverse collection of paths, you can actually and if the
    network is actually doing ECN, you can actually get a decent stab
    at this kind of highly adaptive end point where it's actually the
    end hosts. It's doing all of the work in trying to optimize its
    performance through a network that has considerable diversity. If
    I'm the end host, brilliant, nothing could be better if I'm the
    poor. Be nice at network operator, trying desperately hard to
    minimize my bills, maximize my throughput. Whoops, I've just
    handed the keys of the car to the application to the host?
    [George: yeah], I don't feel very good about that.

George Michaelson  44:07

    No, it's an interesting question. Who's in control of this bus at
    this point, it's a bit like there are many hands on the steering
    wheel. Geoff

Geoff Huston  44:15

    Oh, I think there's only one. And as usual, I'll end it by saying
    everything is answered with money the richest people on the planet
    these days build the end systems. You know, the Googles, the
    Apples of this world, the chrome and so on. The poke at the
    application level, the stack are at the driver's seat. [George:
    Yeah], the network operators are being hit around the face with
    wet fish continuously with every new shock. And it's kind of it's
    a sad life. It is a sad life, but, you know, that's the way it's
    panned out. We kind of put the power at the end point, so that's
    where we're living.

George Michaelson  44:48

    Not a bad place to be. In some ways, I'm surprised by how complex
    this story of doing things in parallel gets Geoff. That's been
    really fascinating. Thank you.

Geoff Huston  44:57

    Well, thank you and dear listener, if you've listened this long,
    Thank you for your persistence.

George Michaelson  45:03

    If you've got a story or research to share here on ping, why not
    get in contact by email to ping@apnic.net or via the APNIC social
    media channels. Also remember the measurement@apnic.net mailing
    lists on orbit is there to discuss and share relevant
    collaborative opportunities, grants and funding opportunities,
    jobs and graduate placings, or to seek feedback from the community
    on your own measurement projects, be sure to check out the APNIC
    website for all your resource and community needs until next time
    you