1 00:00:03,520 --> 00:00:05,919 Welcome to episode 414 2 00:00:05,919 --> 00:00:08,960 of the Microsoft Cloud IT Pro podcast recorded 3 00:00:08,960 --> 00:00:11,859 live on 10/31/2025. 4 00:00:12,160 --> 00:00:14,894 This is a show about Microsoft March in 5 00:00:14,894 --> 00:00:17,214 Azure from the perspective of IT pros and 6 00:00:17,214 --> 00:00:19,535 end users, where we discuss the topic of 7 00:00:19,535 --> 00:00:21,875 recent news and how it relates to you. 8 00:00:22,015 --> 00:00:22,515 Fortunately, 9 00:00:22,815 --> 00:00:24,815 when we went to record this, the Internet 10 00:00:24,815 --> 00:00:27,074 is back online after an AWS 11 00:00:27,454 --> 00:00:30,570 and Azure outage, both related to DNS and 12 00:00:30,570 --> 00:00:32,429 both within the last couple of weeks. 13 00:00:32,810 --> 00:00:35,289 So what better to discuss today than what 14 00:00:35,289 --> 00:00:37,710 happened, how it was resolved, and what IT 15 00:00:37,770 --> 00:00:40,685 pros should keep in mind for future resilience 16 00:00:40,744 --> 00:00:43,325 planning when it comes to your cloud infrastructure. 17 00:00:45,945 --> 00:00:48,265 So, Scott, I saw this funny meme the 18 00:00:48,265 --> 00:00:50,024 other day. I'm gonna read it to you. 19 00:00:50,024 --> 00:00:52,284 I intentionally did not read this to you 20 00:00:52,504 --> 00:00:53,004 earlier. 21 00:00:53,384 --> 00:00:55,679 So I saw this. Somebody sent it to 22 00:00:55,679 --> 00:00:57,120 me. I have to go oh, I know 23 00:00:57,120 --> 00:00:58,240 where it is. I have to go find 24 00:00:58,240 --> 00:00:59,780 it. I should have pulled it up earlier. 25 00:01:00,320 --> 00:01:02,240 And this can tie into another topic as 26 00:01:02,240 --> 00:01:04,340 well. Where is that 27 00:01:04,719 --> 00:01:05,219 message? 28 00:01:06,319 --> 00:01:08,834 Wow. Okay. Here you go, Scott. After getting 29 00:01:08,834 --> 00:01:10,375 fired from ungrateful 30 00:01:10,754 --> 00:01:11,254 AWS, 31 00:01:12,114 --> 00:01:14,194 after an outage where my job was to 32 00:01:14,194 --> 00:01:16,694 Vibe code all the DNS entries to IPv 33 00:01:16,754 --> 00:01:19,234 six, happy to announce that it's my first 34 00:01:19,234 --> 00:01:21,795 today at Azure. Azure recognizes the value of 35 00:01:21,795 --> 00:01:24,390 Vibe coding IPv six DNS, and I just 36 00:01:24,390 --> 00:01:27,109 force pushed my first 1,000,000 entries. Now off 37 00:01:27,109 --> 00:01:28,170 to grab some coffee. 38 00:01:29,510 --> 00:01:32,150 Yes. I've seen this one. The Internet. Somebody 39 00:01:32,150 --> 00:01:33,049 has been following 40 00:01:33,590 --> 00:01:35,290 at wrecked on x. 41 00:01:35,990 --> 00:01:37,750 Yes. Actually, someone sent it. I do not 42 00:01:37,750 --> 00:01:39,965 follow them, but somebody sent that because 43 00:01:40,265 --> 00:01:43,384 DNS is apparently hard as evidenced by this 44 00:01:43,384 --> 00:01:45,965 last week of both AWS and Azure. 45 00:01:46,265 --> 00:01:47,864 I guess it wasn't quite within a week. 46 00:01:47,864 --> 00:01:50,825 AWS was October 20. Azure was October 29. 47 00:01:50,825 --> 00:01:52,650 Nine days, There was a little bit of 48 00:01:52,650 --> 00:01:53,709 a spread in between, 49 00:01:54,730 --> 00:01:55,870 but it does happen. 50 00:01:56,170 --> 00:01:58,109 It's always a good reminder when 51 00:01:59,209 --> 00:02:00,750 the cloud goes down 52 00:02:01,130 --> 00:02:02,670 that it really 53 00:02:03,049 --> 00:02:05,954 is somebody else's data center someplace else. It's 54 00:02:05,954 --> 00:02:06,694 just not 55 00:02:06,995 --> 00:02:09,555 it's not your data center. These things tend 56 00:02:09,555 --> 00:02:12,375 to be far reaching. I'm always 57 00:02:12,915 --> 00:02:13,415 amazed 58 00:02:13,955 --> 00:02:14,455 when 59 00:02:15,074 --> 00:02:18,055 Herndon goes down, so like US East Virginia 60 00:02:18,115 --> 00:02:19,014 for AWS, 61 00:02:19,715 --> 00:02:20,215 and 62 00:02:20,790 --> 00:02:23,349 50% of the Internet just goes offline. Because 63 00:02:23,349 --> 00:02:24,969 there are so many 64 00:02:25,430 --> 00:02:26,250 of the 65 00:02:26,870 --> 00:02:27,849 modern day 66 00:02:28,229 --> 00:02:29,129 SaaS services, 67 00:02:29,430 --> 00:02:31,270 like the things that you would depend on, 68 00:02:31,270 --> 00:02:33,689 like, hey, I listen to music on Spotify, 69 00:02:33,830 --> 00:02:36,294 I stream my podcast from here, I do 70 00:02:36,294 --> 00:02:38,875 my banking with like, all these different things 71 00:02:39,175 --> 00:02:41,194 are all homed out of that region. 72 00:02:41,655 --> 00:02:44,474 So when bad things happen to Herndon, 73 00:02:44,935 --> 00:02:46,715 particularly in AWS land, 74 00:02:47,495 --> 00:02:49,995 bad things tend to happen on the Internet 75 00:02:50,069 --> 00:02:51,990 for the rest of us or at least 76 00:02:51,990 --> 00:02:54,230 I think the parts of the Internet that 77 00:02:54,230 --> 00:02:57,110 folks who listen to this podcast would go 78 00:02:57,110 --> 00:02:58,790 for. So for me, like I said, that's 79 00:02:58,790 --> 00:03:00,650 things like Spotify going down, 80 00:03:01,110 --> 00:03:02,250 that is 81 00:03:03,094 --> 00:03:06,134 Reddit suddenly disappearing and going no. Yep. There 82 00:03:06,134 --> 00:03:07,354 went the body of knowledge 83 00:03:07,735 --> 00:03:09,655 that was pulling all these things out. And 84 00:03:09,655 --> 00:03:11,194 then in this new world 85 00:03:11,574 --> 00:03:14,294 of LLMs and everything else that are doing 86 00:03:14,294 --> 00:03:17,400 both ingested plus real time searches of these 87 00:03:17,400 --> 00:03:17,900 systems, 88 00:03:18,199 --> 00:03:21,000 like, all that stuff starts to show its 89 00:03:21,000 --> 00:03:21,500 cracks 90 00:03:21,879 --> 00:03:22,939 along the way 91 00:03:24,120 --> 00:03:27,740 as well. So the AWS one, interestingly, 92 00:03:28,040 --> 00:03:29,960 like, manifests, I think, is a little bit 93 00:03:29,960 --> 00:03:32,294 of, like, oh, this all sounds like a 94 00:03:32,294 --> 00:03:34,775 lot of DNS. My understanding was it was 95 00:03:34,775 --> 00:03:36,955 actually a problem with DynamoDB 96 00:03:37,335 --> 00:03:40,395 and kinda light load balancing with Dynamo and 97 00:03:40,534 --> 00:03:41,995 the way that they push 98 00:03:42,455 --> 00:03:42,955 configuration 99 00:03:43,495 --> 00:03:45,520 and things like that into it. But I 100 00:03:45,520 --> 00:03:46,639 could be a little bit off there. I 101 00:03:46,639 --> 00:03:48,400 didn't have a ton of time to dive 102 00:03:48,400 --> 00:03:49,060 into theirs, 103 00:03:49,360 --> 00:03:51,759 especially, like you said, with the Azure outage 104 00:03:51,759 --> 00:03:54,639 coming on October 29, just nine days later, 105 00:03:54,639 --> 00:03:55,939 and that one being 106 00:03:56,240 --> 00:03:59,574 certainly more DNS related or at least like 107 00:03:59,574 --> 00:04:01,735 a I think to the spirit of it 108 00:04:01,735 --> 00:04:03,814 being that it was Azure Front Door and 109 00:04:03,814 --> 00:04:04,314 kinda 110 00:04:04,615 --> 00:04:06,694 some of the global load balancing capabilities of 111 00:04:06,694 --> 00:04:08,855 Front Door that got out of whack due 112 00:04:08,855 --> 00:04:09,515 to a 113 00:04:10,375 --> 00:04:13,094 configuration update. And in both cases, in both 114 00:04:13,094 --> 00:04:15,490 systems, these were configuration updates 115 00:04:15,950 --> 00:04:18,449 that kind of went a little bit sideways, 116 00:04:18,750 --> 00:04:20,830 and things got a little bit squirrely. It's 117 00:04:20,830 --> 00:04:24,689 hard. Stuff at that scale is very complicated, 118 00:04:24,830 --> 00:04:25,649 but it always 119 00:04:26,029 --> 00:04:27,889 amazes me how 120 00:04:28,865 --> 00:04:31,444 one of those configuration changes 121 00:04:31,904 --> 00:04:33,285 can take down everything 122 00:04:33,824 --> 00:04:36,384 so quickly that because we've seen it multiple 123 00:04:36,384 --> 00:04:39,425 times from multiple different cloud vendors where you 124 00:04:39,425 --> 00:04:40,865 would think they would have figured out by 125 00:04:40,865 --> 00:04:42,779 this time how they could do, like, small 126 00:04:42,779 --> 00:04:45,680 configuration changes that don't have the snowball effect, 127 00:04:45,899 --> 00:04:47,599 but yet we continue to 128 00:04:47,899 --> 00:04:50,459 see these. And, yeah, both were DNS. I 129 00:04:50,459 --> 00:04:52,620 was reading some on the AWS one too, 130 00:04:52,620 --> 00:04:54,459 and it sounds like it was it was 131 00:04:54,459 --> 00:04:54,959 Dynamo 132 00:04:55,535 --> 00:04:56,035 DB, 133 00:04:56,495 --> 00:04:57,394 but updating 134 00:04:57,774 --> 00:04:59,774 that's used to update DNS. And it was 135 00:04:59,774 --> 00:05:02,014 like two different services were trying to update 136 00:05:02,014 --> 00:05:04,415 the same DNS records tied to Dynamo DB 137 00:05:04,415 --> 00:05:06,495 and, oh, and two things are trying to 138 00:05:06,495 --> 00:05:08,654 update the same DNS record. It's like trying 139 00:05:08,654 --> 00:05:10,959 to update the same line in a file 140 00:05:11,019 --> 00:05:13,259 multiple times and SharePoint complaining that you have 141 00:05:13,259 --> 00:05:16,319 version mismatches? It's definitely possible to 142 00:05:17,100 --> 00:05:18,639 encounter these race conditions. 143 00:05:19,339 --> 00:05:22,399 Even small changes do have big impacts, so 144 00:05:22,459 --> 00:05:24,459 I think it's a little it's a little 145 00:05:24,459 --> 00:05:26,754 off or maybe, like, not the right color 146 00:05:26,754 --> 00:05:28,754 to say, like, oh, it's surprising when a 147 00:05:28,754 --> 00:05:30,134 little configuration change 148 00:05:30,514 --> 00:05:32,595 or, like, that a bigger configuration change goes 149 00:05:32,595 --> 00:05:34,915 out. Like, all these things go out, whether 150 00:05:34,915 --> 00:05:38,055 it's Amazon, whether it's Microsoft, whether it's Google. 151 00:05:38,459 --> 00:05:40,879 Everybody has their own deployment practices 152 00:05:41,259 --> 00:05:42,639 for safe deployments, 153 00:05:43,100 --> 00:05:45,339 for making sure that things get flighted through, 154 00:05:45,339 --> 00:05:48,139 like, multiple rings and they follow a general 155 00:05:48,139 --> 00:05:49,980 progression. You see the same thing, like, when 156 00:05:49,980 --> 00:05:51,740 a feature rolls out in SharePoint, for example. 157 00:05:51,740 --> 00:05:53,339 Right? We all know about the different rings 158 00:05:53,339 --> 00:05:55,634 that go in there with deployment rings and 159 00:05:55,634 --> 00:05:58,115 things like that. So it's the best of 160 00:05:58,115 --> 00:05:58,615 intentions. 161 00:05:59,394 --> 00:06:02,034 The interesting thing for me in the a 162 00:06:02,194 --> 00:06:03,254 AWS RCA 163 00:06:03,634 --> 00:06:05,394 was they got into some of the nitty 164 00:06:05,394 --> 00:06:07,094 gritty around how 165 00:06:07,474 --> 00:06:10,370 complex these things are with all these microservices 166 00:06:10,830 --> 00:06:13,710 that are running, talking to each other. So 167 00:06:13,710 --> 00:06:16,110 you'd like things are starting to manifest where 168 00:06:16,110 --> 00:06:17,970 we've built these really awesome 169 00:06:18,350 --> 00:06:20,590 machines, right, to go and manage this all 170 00:06:20,590 --> 00:06:22,430 for us and have all this underlying logic 171 00:06:22,430 --> 00:06:24,694 and all these other things into them. But 172 00:06:24,995 --> 00:06:27,875 when these, like, little subtle race conditions are 173 00:06:27,875 --> 00:06:30,055 coming through or other things are coming out 174 00:06:30,115 --> 00:06:32,834 and stuff gets out of whack, in in 175 00:06:32,834 --> 00:06:35,669 the case of the Dynamo thing, these workers 176 00:06:35,729 --> 00:06:37,029 between these various microservices 177 00:06:37,410 --> 00:06:38,229 becoming desynchronized, 178 00:06:39,410 --> 00:06:40,310 bad things 179 00:06:40,769 --> 00:06:41,269 happen. 180 00:06:41,970 --> 00:06:42,470 Right? 181 00:06:42,849 --> 00:06:45,329 So I think in the AWS one just 182 00:06:45,329 --> 00:06:46,930 pulling up their RCA real quick. So they've 183 00:06:46,930 --> 00:06:49,024 got a couple components. They've got this planner 184 00:06:49,024 --> 00:06:50,805 and these enactor workers 185 00:06:51,185 --> 00:06:52,644 within dyno DynamoDB 186 00:06:53,425 --> 00:06:57,204 that help with some some distribution of traffic 187 00:06:57,345 --> 00:06:59,425 and other things via DNS, but it's a 188 00:06:59,425 --> 00:07:02,160 bunch of basically, like, internal components. I'd encourage 189 00:07:02,160 --> 00:07:03,600 somebody to go read about this. Like, if 190 00:07:03,600 --> 00:07:06,879 you're interested in, like, distributed computing, hyperscalers, all 191 00:07:06,879 --> 00:07:09,680 these things, like, it's always interesting to see 192 00:07:09,680 --> 00:07:12,639 how these things are designed. But, you know, 193 00:07:12,639 --> 00:07:15,199 apparently, you had this one service, which is 194 00:07:15,199 --> 00:07:16,660 the DNS Enactor, 195 00:07:17,394 --> 00:07:20,055 which when it fires up, it verifies 196 00:07:20,435 --> 00:07:23,074 plan freshness, what it's supposed to do, what 197 00:07:23,074 --> 00:07:24,134 it's supposed to process, 198 00:07:24,595 --> 00:07:27,634 what updates it's or endpoints it's supposed to 199 00:07:27,634 --> 00:07:29,175 update, all those things. 200 00:07:29,620 --> 00:07:32,839 Turns out, the DNS and actor did within 201 00:07:33,139 --> 00:07:36,579 Dynamo does a very, like, sane thing in 202 00:07:36,579 --> 00:07:38,740 that it verifies the freshness of what it 203 00:07:38,740 --> 00:07:39,560 needs to do 204 00:07:39,939 --> 00:07:42,824 anytime that process starts or at the start 205 00:07:42,824 --> 00:07:45,544 of processing. But it's not doing, like, state 206 00:07:45,544 --> 00:07:47,785 management as it goes. It's always assuming that, 207 00:07:47,785 --> 00:07:49,944 hey. I spun up. This is current state. 208 00:07:49,944 --> 00:07:51,865 Let me go make some changes and then 209 00:07:51,865 --> 00:07:54,584 check again kind of thing. So you had 210 00:07:54,584 --> 00:07:57,144 these multiple actors that are talking to each 211 00:07:57,144 --> 00:07:57,644 other, 212 00:07:58,079 --> 00:08:00,240 and like you said, it's a contention issue. 213 00:08:00,240 --> 00:08:02,319 So by the time one spins up and 214 00:08:02,319 --> 00:08:03,919 it says, okay. Here's the plan. Here's what 215 00:08:03,919 --> 00:08:04,980 I'm gonna go do, 216 00:08:05,360 --> 00:08:07,759 and it goes and does it, well, it 217 00:08:07,759 --> 00:08:10,000 turns out that another one was spinning up 218 00:08:10,000 --> 00:08:12,000 with a potentially different plan because that is 219 00:08:12,079 --> 00:08:14,475 they haven't been in flight. And all of 220 00:08:14,475 --> 00:08:16,314 a sudden that check that had been performed 221 00:08:16,314 --> 00:08:18,714 that was fresh was now stale, and it's 222 00:08:18,714 --> 00:08:20,495 applying a stale configuration 223 00:08:21,115 --> 00:08:23,214 and overriding what was already there, 224 00:08:23,514 --> 00:08:25,935 and that leads to a series 225 00:08:26,235 --> 00:08:27,375 of cascading 226 00:08:27,834 --> 00:08:28,334 failures. 227 00:08:29,000 --> 00:08:31,180 And for services like Dynamo, 228 00:08:31,560 --> 00:08:35,740 they're so integral to the fabric of AWS. 229 00:08:36,200 --> 00:08:38,440 So there's a bunch of other services that 230 00:08:38,440 --> 00:08:41,019 are depending on DynamoDB. So if you're 231 00:08:41,325 --> 00:08:43,804 doing compute and you're using virtual machines with 232 00:08:43,804 --> 00:08:44,304 EC2, 233 00:08:44,605 --> 00:08:46,785 you're doing functions with Lambda, 234 00:08:47,245 --> 00:08:51,404 even things like RBAC and I'm ultimately tie 235 00:08:51,404 --> 00:08:54,785 back to these database systems like Dynamo, and 236 00:08:54,925 --> 00:08:57,929 they have these, like, just really bad, no 237 00:08:57,929 --> 00:08:59,070 good, horrible days. 238 00:08:59,929 --> 00:09:01,610 The closer to home for me on my 239 00:09:01,610 --> 00:09:04,190 side, I've seen when we've had outages 240 00:09:04,809 --> 00:09:05,710 in storage 241 00:09:06,409 --> 00:09:09,924 and very similar thing, like, you'd be amazed 242 00:09:09,924 --> 00:09:12,325 at the number of services that depend on 243 00:09:12,325 --> 00:09:15,284 storage for something. Right? They publish some kind 244 00:09:15,284 --> 00:09:16,264 of state in there. 245 00:09:16,644 --> 00:09:19,044 Maybe they're not even using, like, unstructured storage. 246 00:09:19,044 --> 00:09:20,725 It's not like they're storing logs or something, 247 00:09:20,725 --> 00:09:22,529 but maybe they're using, like, NoSQL 248 00:09:22,830 --> 00:09:24,690 tables or they're using queues 249 00:09:25,070 --> 00:09:27,389 or things like that along the way. So 250 00:09:27,389 --> 00:09:29,389 there there's just a bunch of moving pieces. 251 00:09:29,389 --> 00:09:30,850 There's a bunch of dependencies, 252 00:09:31,950 --> 00:09:33,409 and those dependencies 253 00:09:34,110 --> 00:09:35,169 just tend to 254 00:09:35,710 --> 00:09:37,964 bleed their way out. And I think what 255 00:09:37,964 --> 00:09:39,565 we were seeing a lot more is with 256 00:09:39,565 --> 00:09:42,044 these outages, at least these last couple, these 257 00:09:42,044 --> 00:09:43,644 two most recent ones, and I think if 258 00:09:43,644 --> 00:09:45,504 we look back a couple months as well, 259 00:09:45,565 --> 00:09:48,284 the impacts are just so far reaching because 260 00:09:48,284 --> 00:09:49,904 so many customers today 261 00:09:50,389 --> 00:09:52,950 are dependent on the cloud. Like, I saw 262 00:09:52,950 --> 00:09:54,389 a lot of chatter after this one, like, 263 00:09:54,389 --> 00:09:56,549 oh, AWS went down, and then, oh, Azure 264 00:09:56,549 --> 00:09:57,830 went down, and, oh, we should all be 265 00:09:57,830 --> 00:09:59,590 mount multi cloud, and we should all be 266 00:09:59,590 --> 00:10:02,809 and all these things. Right? Like, sure. Absolutely. 267 00:10:03,269 --> 00:10:05,934 We should. If we had infinite money, infinite 268 00:10:05,934 --> 00:10:08,815 time, infinite skilling, all those kinds of things 269 00:10:08,815 --> 00:10:11,375 that are out there, but that's ultimately not 270 00:10:11,375 --> 00:10:13,134 the reality for a lot of us. So 271 00:10:13,134 --> 00:10:15,554 I fall back to, are these things bad? 272 00:10:15,695 --> 00:10:18,240 Yes. Do we learn from them? Also, yes. 273 00:10:18,240 --> 00:10:20,240 Like like this particular race condition in the 274 00:10:20,240 --> 00:10:21,139 case of AWS, 275 00:10:21,759 --> 00:10:24,100 the thing that happened in Azure, they happened. 276 00:10:24,480 --> 00:10:26,960 They should not happen again because we learn 277 00:10:26,960 --> 00:10:28,720 from them, we implement those changes, and we 278 00:10:28,720 --> 00:10:30,945 go forward. And as bad as it is 279 00:10:30,945 --> 00:10:33,825 to have half the Internet go down, well, 280 00:10:33,825 --> 00:10:35,825 half the Internet was down. It wasn't just 281 00:10:35,825 --> 00:10:38,004 you. It was everybody else. And 282 00:10:38,705 --> 00:10:42,225 the fix also wasn't on you. The fix 283 00:10:42,225 --> 00:10:44,420 was on somebody else. Right? So while all 284 00:10:44,420 --> 00:10:46,980 those servers were catching fire, while everything's spinning 285 00:10:46,980 --> 00:10:49,460 back up and there's just this big retry 286 00:10:49,460 --> 00:10:51,700 storm going on and network links are getting 287 00:10:51,700 --> 00:10:53,860 overloaded and CPU and memory and all these 288 00:10:53,860 --> 00:10:55,735 things are going down, like, as bad as 289 00:10:55,735 --> 00:10:57,415 it sounds to say it, it was somebody 290 00:10:57,415 --> 00:10:58,634 else's problem to fix. 291 00:10:59,495 --> 00:11:02,215 It wasn't our problem to fix. So I'm 292 00:11:02,215 --> 00:11:04,774 still reminded of that part, like and very 293 00:11:04,774 --> 00:11:07,195 mindful that, like, when these things do happen, 294 00:11:08,000 --> 00:11:08,980 yes, they're bad. 295 00:11:09,360 --> 00:11:11,840 Clearly, they can be very severe and go 296 00:11:11,840 --> 00:11:14,159 out there and have some some crazy kind 297 00:11:14,159 --> 00:11:16,100 of impact. But at the same time, 298 00:11:16,639 --> 00:11:18,960 while you're maybe up all night trying to 299 00:11:18,960 --> 00:11:20,899 inform your customers or 300 00:11:21,324 --> 00:11:22,684 you're kind of running around trying to figure 301 00:11:22,684 --> 00:11:25,485 out what's going on, ultimately, that responsibility sits 302 00:11:25,485 --> 00:11:26,464 with somebody else 303 00:11:27,164 --> 00:11:29,245 to make sure that it is ultimately where 304 00:11:29,245 --> 00:11:30,924 it needs to be and that it's back 305 00:11:30,924 --> 00:11:33,565 up and it's running. And I think, like 306 00:11:33,565 --> 00:11:35,504 I said, like, these things happen. 307 00:11:36,044 --> 00:11:37,105 We're talking like 308 00:11:37,589 --> 00:11:39,450 these massive distributed systems. 309 00:11:39,909 --> 00:11:42,709 They're built by the best engineers that are 310 00:11:42,709 --> 00:11:44,169 out there, and 311 00:11:44,549 --> 00:11:46,629 they still have these issues even with testing, 312 00:11:46,629 --> 00:11:48,730 things like that, but they will get hardened. 313 00:11:49,029 --> 00:11:51,190 These are just battles in the war. They 314 00:11:51,190 --> 00:11:52,089 make these systems 315 00:11:52,504 --> 00:11:54,605 more resilient at the end of the day. 316 00:11:54,825 --> 00:11:57,725 Everybody learns from these. Like the AWS outage, 317 00:11:57,865 --> 00:11:59,784 I can guarantee you folks in Azure learn 318 00:11:59,784 --> 00:12:02,105 from. The Azure outage, I can guarantee you 319 00:12:02,105 --> 00:12:04,904 folks at AWS and Google and competitors are 320 00:12:04,904 --> 00:12:07,049 also learning from as well as we're all 321 00:12:07,049 --> 00:12:10,090 publishing these RCAs and getting things out there 322 00:12:10,090 --> 00:12:12,809 and kinda talking about what broke, what we're 323 00:12:12,809 --> 00:12:14,410 doing to make it better, how we're fixing 324 00:12:14,410 --> 00:12:15,929 it. Yeah. And even the whole multi cloud 325 00:12:15,929 --> 00:12:17,769 thing doesn't always work. Like, I was looking 326 00:12:17,769 --> 00:12:19,529 at the AWS and the Azure one, and 327 00:12:19,529 --> 00:12:21,514 under both of them, Starbucks went down. So 328 00:12:21,514 --> 00:12:23,674 it's like Yes. In that case, multi cloud 329 00:12:23,674 --> 00:12:26,235 didn't even help. Like, Starbucks crashed with AWS. 330 00:12:26,235 --> 00:12:28,394 They crashed with Azure. It is what it 331 00:12:28,394 --> 00:12:30,634 is. And the Azure one too, like, you 332 00:12:30,634 --> 00:12:32,735 mentioned the network storm, and I think that's 333 00:12:32,794 --> 00:12:34,154 some of it. We talked about how a 334 00:12:34,154 --> 00:12:36,829 small change can trigger a wide spread effect. 335 00:12:37,370 --> 00:12:39,230 Looking at the Azure outage, 336 00:12:39,690 --> 00:12:41,850 that one was a little bit more that 337 00:12:41,850 --> 00:12:43,529 way where there was a change that was 338 00:12:43,529 --> 00:12:47,309 applied to Front Door configuration change, and 339 00:12:47,690 --> 00:12:49,735 it caused a few of the Front Door 340 00:12:49,735 --> 00:12:52,295 nodes to fail. And then everything starts failing 341 00:12:52,295 --> 00:12:54,774 over to working ones, but the working ones 342 00:12:54,774 --> 00:12:57,894 don't handle all the failovers, and then they 343 00:12:57,894 --> 00:12:59,115 start failing, and 344 00:12:59,495 --> 00:13:01,860 it just snowballs from there where it wasn't 345 00:13:01,860 --> 00:13:03,940 like to your point, people didn't just go 346 00:13:03,940 --> 00:13:05,779 apply everything to all the front doors at 347 00:13:05,779 --> 00:13:08,440 once, but one cascaded to another. 348 00:13:12,339 --> 00:13:14,475 Do you feel overwhelmed by trying to manage 349 00:13:14,475 --> 00:13:16,794 your Office three sixty five environment? Are you 350 00:13:16,794 --> 00:13:20,095 facing unexpected issues that disrupt your company's productivity? 351 00:13:20,315 --> 00:13:22,315 Intelligink is here to help. Much like you 352 00:13:22,315 --> 00:13:24,154 take your car to the mechanic that has 353 00:13:24,154 --> 00:13:26,315 specialized knowledge on how to best keep your 354 00:13:26,315 --> 00:13:29,350 car running, Intelligent helps you with your Microsoft 355 00:13:29,350 --> 00:13:31,690 cloud environment because that's their expertise. 356 00:13:31,990 --> 00:13:34,310 Intelligent keeps up with the latest updates in 357 00:13:34,310 --> 00:13:36,470 the Microsoft cloud to help keep your business 358 00:13:36,470 --> 00:13:38,710 running smoothly and ahead of the curve. Whether 359 00:13:38,710 --> 00:13:40,790 you are a small organization with just a 360 00:13:40,790 --> 00:13:43,184 few users up to an organization of several 361 00:13:43,184 --> 00:13:45,985 thousand employees, they want to partner with you 362 00:13:45,985 --> 00:13:49,365 to implement and administer your Microsoft cloud technology. 363 00:13:49,985 --> 00:13:53,605 Visit them at inteliginc.com/podcast. 364 00:13:53,904 --> 00:14:00,529 That's intelligink.com/podcast 365 00:14:00,910 --> 00:14:03,070 for more information or to schedule a thirty 366 00:14:03,070 --> 00:14:05,169 minute call to get started with them today. 367 00:14:05,389 --> 00:14:08,750 Remember, Intelligink focuses on the Microsoft cloud so 368 00:14:08,750 --> 00:14:10,485 you can focus on your business. 369 00:14:12,644 --> 00:14:14,964 It wasn't our problem fixed, but I also 370 00:14:14,964 --> 00:14:17,284 caused a cascading failure this week, Scott. Unless 371 00:14:17,284 --> 00:14:18,964 you wanna talk more about AWS and Azure 372 00:14:18,964 --> 00:14:21,204 failures. We should talk about the front door 373 00:14:21,204 --> 00:14:24,004 one really quick. Alright. And I think I 374 00:14:24,004 --> 00:14:27,540 just wanna take this opportunity maybe as someone 375 00:14:27,540 --> 00:14:29,700 who's a little bit closer to the lingo 376 00:14:29,700 --> 00:14:32,420 that's used internally around these things to clarify 377 00:14:32,420 --> 00:14:34,420 some things. Yep. I saw a thread on 378 00:14:34,420 --> 00:14:38,580 Reddit that was diving into the front door 379 00:14:38,580 --> 00:14:41,384 outage. And if you go read the RCA 380 00:14:42,084 --> 00:14:44,084 that comes out, like, I'll go with this 381 00:14:44,084 --> 00:14:45,924 first sentence in the what went wrong and 382 00:14:45,924 --> 00:14:51,044 why. An inadvertent tenant configuration change within Azure 383 00:14:51,044 --> 00:14:52,745 Front Door triggered a widespread 384 00:14:53,110 --> 00:14:56,009 service disruption affecting both Microsoft services 385 00:14:56,470 --> 00:14:57,769 and customer applications 386 00:14:58,230 --> 00:15:01,269 dependent on Azure Front Door for global content 387 00:15:01,269 --> 00:15:02,710 delivery. And I'm gonna go back to the 388 00:15:02,710 --> 00:15:04,570 very first part of that. An inadvertent 389 00:15:05,110 --> 00:15:07,049 tenant configuration change 390 00:15:07,455 --> 00:15:10,254 within Azure Front Door triggered a widespread service 391 00:15:10,254 --> 00:15:10,754 disruption. 392 00:15:11,055 --> 00:15:13,375 There were folks on Reddit who were reading 393 00:15:13,375 --> 00:15:15,634 that, and they were taking that terminology 394 00:15:16,175 --> 00:15:18,355 of a tenant configuration change 395 00:15:18,894 --> 00:15:23,039 to mean that a customer tenant, like you, 396 00:15:23,039 --> 00:15:24,799 maybe you have a front door profile and 397 00:15:24,799 --> 00:15:26,100 I have a front door profile, 398 00:15:26,559 --> 00:15:28,879 that you would have the ability to push 399 00:15:28,879 --> 00:15:29,539 a configuration 400 00:15:29,840 --> 00:15:32,639 change to your front door profile that would 401 00:15:32,639 --> 00:15:34,799 take down the whole system. That cascaded to 402 00:15:34,799 --> 00:15:37,504 everything? Yeah. I coulda told you that from 403 00:15:37,504 --> 00:15:39,825 externally. Right? But I can see where that 404 00:15:39,825 --> 00:15:43,684 language tenant is used so broadly. So broadly. 405 00:15:44,785 --> 00:15:46,785 Yeah. Familiar with it could take that. So 406 00:15:46,785 --> 00:15:48,384 I just wanted to maybe provide a little 407 00:15:48,384 --> 00:15:50,544 bit of clarification there. So when we say 408 00:15:50,544 --> 00:15:51,830 tenant in 409 00:15:52,610 --> 00:15:53,670 this respect, 410 00:15:54,290 --> 00:15:57,190 really what we're saying is service tenant 411 00:15:57,490 --> 00:16:00,610 or maybe tenant that the service itself is 412 00:16:00,610 --> 00:16:02,850 hosted on. So may maybe another word for 413 00:16:02,850 --> 00:16:05,144 tenant here would be scale unit. Like, what 414 00:16:05,144 --> 00:16:07,485 are the scale units that host Front Door 415 00:16:07,544 --> 00:16:08,044 versus 416 00:16:08,504 --> 00:16:09,644 what are the actual 417 00:16:10,184 --> 00:16:11,945 customer tenants and things that are out there? 418 00:16:11,945 --> 00:16:14,365 And I think the confusion for this one 419 00:16:14,584 --> 00:16:17,464 was maybe a little bit further born out 420 00:16:17,464 --> 00:16:20,360 of the fact that the front door team 421 00:16:20,819 --> 00:16:22,120 has currently blocked 422 00:16:22,500 --> 00:16:25,940 all front door configuration changes. Oh, interesting. If 423 00:16:25,940 --> 00:16:27,459 you have a front door profile and I 424 00:16:27,459 --> 00:16:28,679 have a front door profile, 425 00:16:29,059 --> 00:16:31,779 we are blocked from making changes to those 426 00:16:31,779 --> 00:16:34,105 profiles right now. And I think this kinda 427 00:16:34,105 --> 00:16:35,644 perpetuates that thinking 428 00:16:36,105 --> 00:16:38,345 that, oh, you and I are blocked from 429 00:16:38,345 --> 00:16:40,825 making changes, and that's because I could make 430 00:16:40,825 --> 00:16:43,485 a change that's gonna impact you. And 431 00:16:43,865 --> 00:16:45,945 I don't think that's the case with this 432 00:16:45,945 --> 00:16:47,644 one. I think this is more like 433 00:16:47,970 --> 00:16:50,389 scale units, internal service things, 434 00:16:50,929 --> 00:16:53,490 all of that again. So there was a 435 00:16:53,490 --> 00:16:55,350 configuration change internally. 436 00:16:56,129 --> 00:16:58,629 That configuration change introduced 437 00:16:59,009 --> 00:17:02,414 an invalid state, very similar to those race 438 00:17:02,414 --> 00:17:04,755 conditions that we were talking about with Dynamo 439 00:17:04,815 --> 00:17:06,115 on on the other side. 440 00:17:06,494 --> 00:17:07,474 That inconsistent 441 00:17:07,855 --> 00:17:08,355 state 442 00:17:08,894 --> 00:17:12,815 caused a whole bunch of AFD tenants or 443 00:17:12,815 --> 00:17:16,174 AFD nodes, AFD scale units, whatever we wanna 444 00:17:16,174 --> 00:17:17,660 call them, to crash, 445 00:17:17,960 --> 00:17:20,360 and on that crash, to subsequently not be 446 00:17:20,360 --> 00:17:22,059 able to load properly. 447 00:17:22,519 --> 00:17:24,440 So Azure Front Door is kind of a 448 00:17:24,440 --> 00:17:25,660 global load balancer 449 00:17:25,960 --> 00:17:28,440 and a DNS load balancer. All of a 450 00:17:28,440 --> 00:17:30,565 sudden, you started seeing all this weird stuff, 451 00:17:30,565 --> 00:17:31,465 increased latencies, 452 00:17:31,845 --> 00:17:33,545 timeouts, connection errors 453 00:17:33,845 --> 00:17:34,345 for 454 00:17:34,725 --> 00:17:37,605 every sort of downstream service that exists out 455 00:17:37,605 --> 00:17:40,725 there. So, like, in storage land, you ever 456 00:17:40,725 --> 00:17:43,619 provisioned a ZRS storage account? A ZRS storage 457 00:17:43,619 --> 00:17:46,200 account, your DNS endpoint, your public endpoint 458 00:17:46,820 --> 00:17:49,539 is a DNS CNAME that is part of 459 00:17:49,539 --> 00:17:50,680 a front door profile 460 00:17:51,140 --> 00:17:53,460 and points to a front door profile. So, 461 00:17:53,779 --> 00:17:55,140 not good. Right? Like, all of a sudden 462 00:17:55,140 --> 00:17:56,279 your z ZRS 463 00:17:56,660 --> 00:17:58,855 zone zone of resilient thing, like, could be 464 00:17:58,855 --> 00:18:00,774 having some trouble due to lack of DNS 465 00:18:00,774 --> 00:18:01,274 resolution. 466 00:18:01,815 --> 00:18:04,534 The other one that happens in Azure land 467 00:18:04,534 --> 00:18:05,034 is 468 00:18:05,494 --> 00:18:08,474 so much of the tooling talks to API 469 00:18:08,534 --> 00:18:11,654 endpoints that are available via Front Door or 470 00:18:11,654 --> 00:18:14,119 fronted via Front Door. So you think about, 471 00:18:14,119 --> 00:18:16,440 like, management.azure.com, 472 00:18:16,440 --> 00:18:19,799 which is the restful API surface for all 473 00:18:19,799 --> 00:18:22,519 of Azure Resource Manager. That's behind Front Door. 474 00:18:22,519 --> 00:18:24,440 Lots of folks notice it when the portal 475 00:18:24,440 --> 00:18:26,835 goes down because you just go to, say, 476 00:18:26,835 --> 00:18:29,554 you're in a public Azure customer, it doesn't 477 00:18:29,554 --> 00:18:32,034 matter if you're in The United Kingdom or 478 00:18:32,034 --> 00:18:34,034 The United States. We all just go to 479 00:18:34,034 --> 00:18:35,815 portal.azure.com, 480 00:18:35,875 --> 00:18:37,734 and we get directed redirected 481 00:18:38,115 --> 00:18:40,054 to the closest portal instance 482 00:18:40,509 --> 00:18:43,390 via DNS load balancing via traffic manager. So 483 00:18:43,390 --> 00:18:45,630 there's actually, like, regional endpoints for the portal, 484 00:18:45,630 --> 00:18:47,390 but they're all masked out because they're part 485 00:18:47,390 --> 00:18:50,190 of this resolution chain on the DNS side 486 00:18:50,190 --> 00:18:52,049 that can go a little sideways 487 00:18:52,750 --> 00:18:54,130 in in the case of 488 00:18:54,644 --> 00:18:57,045 Front Door clearing out and getting to where 489 00:18:57,045 --> 00:18:58,644 it's knee where it needs to be. So, 490 00:18:58,644 --> 00:19:01,065 yeah, definitely not a good look for either 491 00:19:01,845 --> 00:19:04,404 Azure or AWS on this one. I'm very 492 00:19:04,404 --> 00:19:07,285 mindful of, like, the customer pain that's felt 493 00:19:07,285 --> 00:19:09,205 on these and the friction that comes with 494 00:19:09,205 --> 00:19:10,750 it. I think the consolation 495 00:19:11,450 --> 00:19:11,950 is, 496 00:19:12,409 --> 00:19:13,549 one, as 497 00:19:14,329 --> 00:19:15,230 folks who 498 00:19:15,769 --> 00:19:18,009 curate and look after these environments that are 499 00:19:18,009 --> 00:19:20,190 hosted in Azure AWS, 500 00:19:20,809 --> 00:19:22,329 as much as we own the message to 501 00:19:22,329 --> 00:19:24,169 our users that, yeah, it's broken and it's 502 00:19:24,169 --> 00:19:26,105 down, at least we don't have to own 503 00:19:26,105 --> 00:19:28,264 the fix for it, which double edged sword. 504 00:19:28,264 --> 00:19:29,464 I I don't think many of us could 505 00:19:29,464 --> 00:19:31,625 fix it faster than the folks who built 506 00:19:31,625 --> 00:19:32,284 these things 507 00:19:32,585 --> 00:19:35,065 could anyway all along the way. But it 508 00:19:35,065 --> 00:19:36,984 it does give us some stuff to go 509 00:19:36,984 --> 00:19:38,524 out and think about 510 00:19:38,960 --> 00:19:40,320 and see if we can do a little 511 00:19:40,320 --> 00:19:42,480 bit differently next time. Yeah. And while they 512 00:19:42,480 --> 00:19:43,700 do go down, 513 00:19:44,160 --> 00:19:45,539 I would say lately, 514 00:19:46,160 --> 00:19:48,339 and this was kinda the case of AWS 515 00:19:48,400 --> 00:19:50,799 and Azure, I would say, is I feel 516 00:19:50,799 --> 00:19:51,299 like 517 00:19:51,759 --> 00:19:53,460 response times and fix times 518 00:19:54,015 --> 00:19:57,154 for Azure and AWS have gone gotten quicker. 519 00:19:57,295 --> 00:19:58,035 Like, the 520 00:19:58,414 --> 00:20:00,414 time from when they first go down to 521 00:20:00,414 --> 00:20:01,634 when they come back online 522 00:20:02,095 --> 00:20:04,515 used to and I guess I think to 523 00:20:04,575 --> 00:20:06,815 several years ago where you'd see outages that 524 00:20:06,815 --> 00:20:08,894 would be, like, day long outages, whether it 525 00:20:08,894 --> 00:20:10,099 was eight, ten, 526 00:20:10,400 --> 00:20:12,799 twelve, twenty four hours. There have been outages 527 00:20:12,799 --> 00:20:13,859 in Azure, AWS, 528 00:20:14,559 --> 00:20:16,960 Microsoft three sixty five, all of those. I 529 00:20:16,960 --> 00:20:19,599 feel like the recovery time when everything like, 530 00:20:19,599 --> 00:20:22,694 to catch the issue starting to happen to 531 00:20:22,694 --> 00:20:24,694 where it's starting to resolve. Maybe it's not 532 00:20:24,694 --> 00:20:25,674 completely resolved, 533 00:20:26,134 --> 00:20:28,774 but you're not hard down for, like, eight, 534 00:20:28,774 --> 00:20:31,255 ten hours. Companies have gotten better at that, 535 00:20:31,255 --> 00:20:34,214 catching it, mitigating it, and getting things back 536 00:20:34,214 --> 00:20:36,559 up quickly or at least starting to get 537 00:20:36,559 --> 00:20:38,320 them back up quickly. That seems to have 538 00:20:38,320 --> 00:20:40,160 been gotten a lot better, I would say, 539 00:20:40,160 --> 00:20:41,680 in the last few years. It goes both 540 00:20:41,680 --> 00:20:43,840 ways. When the entire Internet is down, it 541 00:20:43,840 --> 00:20:44,820 feels like forever. 542 00:20:45,279 --> 00:20:47,200 And it's not just when the entire Internet's 543 00:20:47,200 --> 00:20:48,500 down. I I think there's 544 00:20:48,994 --> 00:20:51,875 economic loss that's associated with these things. So 545 00:20:51,875 --> 00:20:53,955 I saw some estimates talking about, like, the 546 00:20:53,955 --> 00:20:56,914 AWS outage even for the, quote, unquote, brief 547 00:20:56,914 --> 00:20:58,835 period of time that it was being as 548 00:20:58,835 --> 00:21:01,315 high as, like, 500 to $600,000,000 549 00:21:01,315 --> 00:21:03,029 in lost revenue. Yeah. I saw some of 550 00:21:03,029 --> 00:21:05,269 those numbers too. For the companies that that 551 00:21:05,269 --> 00:21:07,610 are hosted on top of it. I think 552 00:21:07,830 --> 00:21:08,970 like any 553 00:21:09,350 --> 00:21:11,430 dark cloud, like, you gotta look for the 554 00:21:11,430 --> 00:21:14,070 silver linings. It can't always be glass half 555 00:21:14,070 --> 00:21:16,970 empty kind of thing. So I will say 556 00:21:17,134 --> 00:21:20,174 a couple of maybe, like, positive things that 557 00:21:20,174 --> 00:21:22,335 happen in both of these outages, both the 558 00:21:22,335 --> 00:21:25,134 AWS one and the Azure one. I'm seeing 559 00:21:25,134 --> 00:21:28,414 that communication's getting better. So while folks are 560 00:21:28,414 --> 00:21:30,835 still complaining that, like, oh, the status pages 561 00:21:30,894 --> 00:21:33,500 aren't updating, things like that, I do think 562 00:21:33,500 --> 00:21:35,440 the kind of proactive communication, 563 00:21:35,740 --> 00:21:37,679 like, we're finding a better balance between 564 00:21:38,059 --> 00:21:40,139 how many engineers do we put on fixing 565 00:21:40,139 --> 00:21:42,619 the problem, which I generally, I would say 566 00:21:42,619 --> 00:21:44,700 let's index towards putting everybody on it. But 567 00:21:44,700 --> 00:21:46,139 if we put everybody on it, that's at 568 00:21:46,139 --> 00:21:47,899 the expense of being able to communicate to 569 00:21:47,899 --> 00:21:49,525 customers, because we might even be taking the 570 00:21:49,525 --> 00:21:51,525 person who can take that message and and 571 00:21:51,525 --> 00:21:52,404 figure out how to get it to where 572 00:21:52,404 --> 00:21:53,924 you need to be. So I think the 573 00:21:53,924 --> 00:21:56,884 transparent communication getting way better. I've been really 574 00:21:56,884 --> 00:21:58,585 impressed by the 575 00:21:59,205 --> 00:22:00,265 post incident 576 00:22:01,009 --> 00:22:03,269 reviews that have come out from both Amazon 577 00:22:03,569 --> 00:22:05,190 and Azure recently. 578 00:22:05,809 --> 00:22:07,730 They're kinda going above and beyond in the 579 00:22:07,730 --> 00:22:10,309 things that they talk about and expose. Like, 580 00:22:10,369 --> 00:22:12,369 you as a regular customer, me as regular 581 00:22:12,369 --> 00:22:14,625 customer, we should never need to know the 582 00:22:14,625 --> 00:22:15,845 names of 583 00:22:16,304 --> 00:22:16,964 the internal 584 00:22:17,424 --> 00:22:17,924 microservices 585 00:22:18,544 --> 00:22:19,924 that are part of DynamoDB. 586 00:22:20,625 --> 00:22:22,625 And, like, we should never need to know 587 00:22:22,625 --> 00:22:23,125 about 588 00:22:23,585 --> 00:22:27,424 these things like AWS's internal planner and enactor 589 00:22:27,424 --> 00:22:27,924 workers. 590 00:22:28,230 --> 00:22:30,950 Like, alright. Great. Like, let's not worry about 591 00:22:30,950 --> 00:22:32,390 that kind of thing. So I think you 592 00:22:32,390 --> 00:22:34,250 are seeing, like, a level of transparency 593 00:22:34,630 --> 00:22:35,929 from the hypervisors 594 00:22:36,630 --> 00:22:37,690 that run these things 595 00:22:38,150 --> 00:22:40,329 and good transparent communication 596 00:22:40,630 --> 00:22:42,170 that's happening during the outages. 597 00:22:42,575 --> 00:22:44,095 The other thing I'll call out, like, these 598 00:22:44,095 --> 00:22:46,515 took some time in both cases to fix, 599 00:22:46,734 --> 00:22:48,835 but all those rollback procedures 600 00:22:49,454 --> 00:22:51,634 and stopping the bleeding and all that stuff, 601 00:22:51,694 --> 00:22:53,855 it worked. We're sitting here a week later 602 00:22:53,855 --> 00:22:55,855 and people are still banging their heads against 603 00:22:55,855 --> 00:22:57,295 the wall going, we don't know what the 604 00:22:57,295 --> 00:22:58,890 problem is. We don't know how to fix 605 00:22:58,890 --> 00:23:01,369 it. We don't know what changed. We don't 606 00:23:01,369 --> 00:23:03,549 know what happened. That's not the case here. 607 00:23:03,610 --> 00:23:05,690 Like, these things happened. There were point in 608 00:23:05,690 --> 00:23:08,410 time, tons of friction, tons of pain, horrible, 609 00:23:08,410 --> 00:23:11,049 yes. But they got fixed. They got fixed 610 00:23:11,049 --> 00:23:12,830 by somebody else, and they were fixed successfully. 611 00:23:13,315 --> 00:23:14,994 And then for whatever these failure modes are, 612 00:23:14,994 --> 00:23:17,154 like I said, you can be pretty confident 613 00:23:17,154 --> 00:23:18,595 that they're not gonna happen in the future. 614 00:23:18,595 --> 00:23:20,595 Are other things gonna happen? Yes. They haven't 615 00:23:20,595 --> 00:23:22,835 been discovered yet. But as they are, it 616 00:23:22,835 --> 00:23:25,154 all bleeds to more resiliency and it lends 617 00:23:25,154 --> 00:23:26,375 itself to more resiliency 618 00:23:27,289 --> 00:23:28,509 for these services. 619 00:23:29,130 --> 00:23:32,509 In some cases, I think there were mitigations 620 00:23:32,730 --> 00:23:34,490 put in place in a timely manner. So 621 00:23:34,490 --> 00:23:36,029 in the case of the front door outage, 622 00:23:36,329 --> 00:23:37,390 I saw that 623 00:23:37,769 --> 00:23:39,230 they actually pulled 624 00:23:39,529 --> 00:23:40,190 the portal 625 00:23:40,575 --> 00:23:42,815 out from behind front door. Like, they went 626 00:23:42,815 --> 00:23:45,454 and manipulated some DNS records to be able 627 00:23:45,454 --> 00:23:47,454 to give customers relief so that they could 628 00:23:47,454 --> 00:23:49,694 reach the portal without having to go through 629 00:23:49,694 --> 00:23:50,194 AFD 630 00:23:50,494 --> 00:23:51,794 and the load balancing 631 00:23:52,174 --> 00:23:52,674 mechanics 632 00:23:53,410 --> 00:23:55,170 that it brings along the way. I think 633 00:23:55,170 --> 00:23:57,490 the tooling's getting better. You're getting the ability 634 00:23:57,490 --> 00:24:00,150 in the tooling to target specific API surfaces, 635 00:24:00,369 --> 00:24:03,109 have other workarounds there, so that's all good. 636 00:24:03,569 --> 00:24:06,345 And, yeah, in general, like, sucks that it 637 00:24:06,345 --> 00:24:08,345 happened, but I'm actually, like, really happy with 638 00:24:08,345 --> 00:24:10,265 the responses here and the way they came 639 00:24:10,265 --> 00:24:12,664 out. Stuff could always go quicker. But that 640 00:24:12,664 --> 00:24:14,904 said, I think for what happened and the 641 00:24:14,904 --> 00:24:16,605 scale of both of these outages, 642 00:24:16,904 --> 00:24:20,684 stuff actually happened in a very timely way. 643 00:24:20,700 --> 00:24:21,440 And, ultimately, 644 00:24:22,380 --> 00:24:23,819 not much that I would have wanted to 645 00:24:23,819 --> 00:24:25,659 do as a customer anyway. Like, if I 646 00:24:25,659 --> 00:24:27,819 was already a multi cloud customer and I'm 647 00:24:27,819 --> 00:24:28,319 hosting 648 00:24:29,419 --> 00:24:30,639 in AWS and Azure, 649 00:24:30,940 --> 00:24:32,299 it's not like I'm gonna go out and 650 00:24:32,299 --> 00:24:34,220 bang on the door and say, well, let's 651 00:24:34,220 --> 00:24:36,154 go put ourselves into Oracle or Google and 652 00:24:36,154 --> 00:24:38,315 get yet another cloud here. Like, that's not 653 00:24:38,315 --> 00:24:39,615 necessarily the answer 654 00:24:40,075 --> 00:24:42,474 or the thing that's going to save you. 655 00:24:42,474 --> 00:24:45,295 You're only as resilient as your least resilient 656 00:24:45,355 --> 00:24:47,355 service kind of thing still at the end 657 00:24:47,355 --> 00:24:48,015 of the day. 658 00:24:48,369 --> 00:24:49,409 I think there is a little bit of 659 00:24:49,409 --> 00:24:51,890 an opportunity for customers to go through. Maybe 660 00:24:51,890 --> 00:24:53,809 you do wanna audit your dependencies a little 661 00:24:53,809 --> 00:24:55,250 bit, like, hey. Do I have to take 662 00:24:55,250 --> 00:24:56,609 a dependency on this thing? Or if I 663 00:24:56,609 --> 00:24:59,429 do, is there an alternative or a fallback 664 00:25:00,130 --> 00:25:02,849 service for me? Along the way, review your 665 00:25:02,849 --> 00:25:05,575 Doctor plans. So while you're not responsible, like 666 00:25:05,575 --> 00:25:07,734 I said, for fixing the servers and and 667 00:25:07,734 --> 00:25:10,295 the underlying microservices that power these things, I 668 00:25:10,295 --> 00:25:11,815 think you still wanna have good ways to 669 00:25:11,815 --> 00:25:14,315 communicate to your users about what's going on. 670 00:25:14,535 --> 00:25:17,095 So if you're a company that works with 671 00:25:17,095 --> 00:25:17,595 Azure 672 00:25:18,809 --> 00:25:21,130 and you have admins who are maybe more 673 00:25:21,130 --> 00:25:22,970 click ops and they're dependent on the Azure 674 00:25:22,970 --> 00:25:24,730 portal, you wanna make sure that you have, 675 00:25:24,730 --> 00:25:25,789 like, good documentation 676 00:25:26,410 --> 00:25:28,490 for your employees about what happens when the 677 00:25:28,490 --> 00:25:29,710 Azure portal is unavailable, 678 00:25:30,184 --> 00:25:31,944 What help happens when the m three sixty 679 00:25:31,944 --> 00:25:33,865 five portal is unavailable? What happens when this 680 00:25:33,865 --> 00:25:35,785 service is unavailable? Just so they know what 681 00:25:35,785 --> 00:25:38,505 to do, and they've got that kinda measured 682 00:25:38,505 --> 00:25:41,244 comfort food. You also need to think about 683 00:25:41,545 --> 00:25:42,845 kinda documenting 684 00:25:43,464 --> 00:25:44,924 recovery plans and expectations 685 00:25:45,640 --> 00:25:46,619 in terms of timing. 686 00:25:47,000 --> 00:25:49,000 So what happens if my cloud provider is 687 00:25:49,000 --> 00:25:51,319 down for ten seconds? What happens if my 688 00:25:51,319 --> 00:25:53,960 cloud provider is down for ten hours? Those 689 00:25:53,960 --> 00:25:55,179 are very different scenarios. 690 00:25:55,480 --> 00:25:57,000 And the way we react, the way we 691 00:25:57,000 --> 00:25:58,619 communicate with our user bases, 692 00:25:59,005 --> 00:26:00,625 all those things are 693 00:26:01,244 --> 00:26:02,304 going to be impacted. 694 00:26:02,765 --> 00:26:04,365 I think you also do have to think, 695 00:26:04,365 --> 00:26:06,304 like, I mentioned status pages. 696 00:26:06,605 --> 00:26:10,065 Both AWS and Azure, like, the status pages 697 00:26:10,125 --> 00:26:12,224 are not the greatest things at getting updated. 698 00:26:12,284 --> 00:26:14,365 So, like, are there alternative systems that you 699 00:26:14,365 --> 00:26:16,819 wanna look at? I see still lots of 700 00:26:16,819 --> 00:26:19,380 customers using things like down detector and things 701 00:26:19,380 --> 00:26:22,179 like that to see when these things are 702 00:26:22,179 --> 00:26:24,899 occurring or if they have broader impact within 703 00:26:24,899 --> 00:26:26,359 geo, outside of geo, 704 00:26:26,914 --> 00:26:28,914 things like that. I think those are all 705 00:26:28,914 --> 00:26:31,394 good to stand up. And then the last 706 00:26:31,394 --> 00:26:33,954 thing I would think about is as you're 707 00:26:33,954 --> 00:26:35,554 going through and you're figuring out maybe some 708 00:26:35,554 --> 00:26:36,694 of these things around 709 00:26:37,154 --> 00:26:39,839 recovery plans, things like that, is making sure 710 00:26:39,839 --> 00:26:41,920 that you're not only setting the expectations with 711 00:26:41,920 --> 00:26:44,400 users, but also setting the expectations with your 712 00:26:44,400 --> 00:26:44,900 leadership. 713 00:26:45,359 --> 00:26:47,039 So, like, if you work for a company 714 00:26:47,039 --> 00:26:49,940 that's single cloud, multi cloud, does your leadership 715 00:26:50,000 --> 00:26:51,140 have the right expectations 716 00:26:51,680 --> 00:26:52,180 around 717 00:26:52,720 --> 00:26:55,575 your company's dependency on the cloud? Has that 718 00:26:55,575 --> 00:26:57,815 been communicated in the right way? Does your 719 00:26:57,815 --> 00:26:58,875 leadership understand 720 00:26:59,494 --> 00:27:00,875 what they've bought into? 721 00:27:01,414 --> 00:27:03,335 Because there's the dream of the cloud, Oh, 722 00:27:03,335 --> 00:27:05,494 it's somebody else's cloud, it's somebody else's problem, 723 00:27:05,494 --> 00:27:08,154 it's 100% available. And then there's the reality 724 00:27:08,214 --> 00:27:10,809 of the cloud, which we know so far, 725 00:27:10,809 --> 00:27:13,630 no system out there is truly a 100%. 726 00:27:13,690 --> 00:27:15,690 So making sure that those things are ready 727 00:27:15,690 --> 00:27:17,929 to go so that your LT can weigh 728 00:27:17,929 --> 00:27:19,929 out all those options they need to, like 729 00:27:19,929 --> 00:27:20,829 multi cloud 730 00:27:21,289 --> 00:27:22,429 strategy options, 731 00:27:22,934 --> 00:27:26,075 ultimately understanding that whole, like, risk reward scenario 732 00:27:26,934 --> 00:27:29,434 or maybe risk versus cost 733 00:27:29,734 --> 00:27:31,514 for things like additional resiliency 734 00:27:31,974 --> 00:27:32,634 and redundancy 735 00:27:33,494 --> 00:27:35,494 and where that all falls out for you. 736 00:27:35,494 --> 00:27:37,619 Sounds good. What with that, Scott? I actually 737 00:27:37,619 --> 00:27:39,640 have family waiting for me to go do 738 00:27:39,779 --> 00:27:42,100 Halloween y stuff. So Halloween y stuff. You 739 00:27:42,100 --> 00:27:43,700 can It is the day for it. At 740 00:27:43,700 --> 00:27:45,720 least the weather is nice here in Jacksonville. 741 00:27:46,100 --> 00:27:48,279 Nice and cool out there. It's a balmy 742 00:27:48,434 --> 00:27:49,955 68. Yep. I think this is the first 743 00:27:49,955 --> 00:27:52,515 year it's under, like, 80 degrees Fahrenheit for 744 00:27:52,515 --> 00:27:54,355 Halloween in a while. It's been a while 745 00:27:54,355 --> 00:27:57,075 since it's been this cool. So yes. Well, 746 00:27:57,075 --> 00:27:59,174 thanks for that. Hopefully, no more DNS 747 00:27:59,634 --> 00:28:01,259 cloud outages here for a while. Hopefully 748 00:28:02,779 --> 00:28:04,940 Yes. It's something that nobody wants to happen. 749 00:28:04,940 --> 00:28:07,099 Nope. So go enjoy your weekend. Enjoy the 750 00:28:07,099 --> 00:28:09,740 rest of your Friday, and we'll be back 751 00:28:09,740 --> 00:28:12,619 again in a couple of weeks. Alright. Sounds 752 00:28:12,619 --> 00:28:14,295 good. Thanks, Ben. Alright. Thanks, 753 00:28:16,295 --> 00:28:18,775 Scott. If you enjoyed the podcast, go leave 754 00:28:18,775 --> 00:28:20,934 us a five star rating in iTunes. It 755 00:28:20,934 --> 00:28:22,615 helps to get the word out so more 756 00:28:22,615 --> 00:28:24,934 IT pros can learn about Office three sixty 757 00:28:24,934 --> 00:28:25,755 five and Azure. 758 00:28:26,295 --> 00:28:27,894 If you have any questions you want us 759 00:28:27,894 --> 00:28:30,160 to address on the show or feedback about 760 00:28:30,160 --> 00:28:32,480 the show, feel free to reach out via 761 00:28:32,480 --> 00:28:34,660 our website, Twitter, or Facebook. 762 00:28:34,960 --> 00:28:36,880 Thanks again for listening, and have a great 763 00:28:36,880 --> 00:28:37,380 day.