Riskgaming

Why engineers are using chaos to make computers more resilient

Description

The CrowdStrike meltdown on July 19th shut down the world with one faulty patch — proving once again the interconnected fragility of global IT systems. On Tuesday this week, the company released its Root Cause Analysis as both an explanation and a mea culpa, but the wider question remains: with so much of our lives dependent on silicon and electrons, how can engineers design resilience into their code from the bottoms up? And more importantly, how can we effectively test how resilient our systems actually are?

⁠Kolton Andrus⁠ is one of the experts on this subject. For years at Amazon and Netflix, he worked on designing fault-tolerant systems, building upon the nascent ideas of the field of chaos engineering, an approach that iteratively and stochastically challenges systems to test for resilience. Now, as CTO and founder of ⁠Gremlin⁠, he’s democratizing access to chaos engineering and reliability testing for everyone.

Kolton joins host ⁠Danny Crichton⁠ and Lux’s scientist-in-residence and complexity specialist ⁠Sam Arbesman⁠. Together, we talk about why resilience must start at the beginning of product design, how resilience is aligning with security as a core value of developer culture, how computer engineering is maturing as a field, and finally, why we need more technological humility about the interconnections of our global compute infrastructure.

Transcript

This is a human-generated transcript, however, it has not been verified for accuracy.

Danny Crichton:                                        

Let's siphon on the triggers for this. So I mean obviously it'll be like a week or two when we finally publish this, but I mean the trigger for this conversation was really this CrowdStrike outage. So CrowdStrike sort of pushed a minor patch, at least from their perspective to all of their computers. Tens of millions of computers globally used CrowdStrike to sort of protect [inaudible 00:00:18] antivirus and security threats. It's really a check the box system, so it's designed to make sure that systems are up-to-date and are sort of reliable.
                                                     

And obviously this patch had a little bit of a flaw in that it basically created an infinite loop, prevented any Windows computer from loading. So a lot of offices, airlines, utility companies basically went to blue screens of death that in many cases, unless you were sort of prepped from an IT perspective for this sort of contingency, it was basically impossible to solve without actual physical access to the machine.
                                                     

So I, a couple of days after the outage actually went to a doctor's office and still like half the computer's in the lobby were blue screens of death. And it was like it still has not been fixed. It was like five, six days afterwards. Delta Air Lines obviously took it a lot of days to get its crew management systems online.
                                                     

And so this really triggered, I think for a lot of folks, this idea of chaos, engineering, reliable systems and the challenge of these sorts of critical points of error where you are trying to use the software to create more robust systems. That is the point of CrowdStrike is to ensure that these systems are protected against malicious actors who are trying to stop them, but then it itself becomes part of the problem and nexus. And to me that is sort of the chaos that comes into a lot of reliable systems. And Kolton, obviously you've spent, I guess 15 years plus working on reliable systems trying to build these sorts of tools across AWS and a bunch of other employers over the years. What were your first thoughts when you heard about the CrowdStrike in the news?

Kolton Andrus:                                        

Yeah, I mean it's a little bit of PTSD when you see something like this happen because your mind goes to that, "I'm in the incident, everything's going wrong. We don't quite know what's happening. We're still debugging, we're still diagnosing. What's the full impact?" That's what I've been trained to respond with is that set of thoughts. Candidly, there's a little bit of engineering hubris. I heard Windows and I was like, "Well, we run on Linux, so phew. Okay." Now maybe I'll sit back and I can take a little bit more of an observer role rather than do I need to jump in and go react immediately?

Danny Crichton:                                        

I mean obviously there's a focus of Windows, which I think in the development world, everyone sort of thinks either OS10 or Linux and Unix and sort of we're all protected over here with reliability, but most of the world still runs on Windows IT, whether at the server level or on desktop end users.

Kolton Andrus:                                        

I think that's the other thing is how would you like the world's infrastructure and architecture to look and how does it actually look? And we could sit back and say, "Well, if everyone ran on Linux, we wouldn't have this problem." And maybe that's true and maybe it's not. It would show up in a different form. But truthfully know what this highlighted is a lot of the world's infrastructure runs on Windows, a lot of very important infrastructure runs on Windows. And so regardless of how you feel about Windows, Windows is important to the world, it's important to our digital infrastructure. If Windows goes down, we as a society are going to suffer. We all have some skin in the game here, even if we're not running a Windows laptop.

Danny Crichton:                                        

Let me ask you, we're talking about the technical and a social aspect here, A. And B, for complicated organizations, I'm thinking like airlines, like Delta. This wasn't just one system. We have this idea of there's a massive application that sort of runs the whole airline. It's really not that, right? There's dozens of subsystems. And so Delta sort of recovered its reservation system really fast. You could log into planes, everything around scheduling was fine. The issue for them was actually a subsystem, which was their crew scheduling software, which was independent, actually took a lot longer to get back on board. And the airline actually struggled for more than a week, thousands of flights canceled to get crews where they were supposed to be.
                                                     

And so I was wondering from your perspective, thinking about building reliable redundant systems, how do you start to think about architecting a world in which every enterprise has dozens of individual corporate applications, all of which are intertwined but somewhat independent of each other?

Kolton Andrus:                                        

Yeah, the airlines in particular, if you follow the news, they've had a few outages in the past few years and sometimes they all rely on this third party subsystem that helps them with scheduling or reservations. If that goes down, they're unable to process things. I've worked with some of the airlines. There's a lot of... the baggage systems are running on Windows in the background doing kind of old school stuff.
                                                     

And so I think it speaks to your point, there's a lot of interconnected pieces here and sometimes we think about the shiny piece, oh, I'm interacting with their mobile app or their website. So everything's connected to that and no, that's one team, operating one system, that ties into these other ones and that's where you get this fan out. This web. Looks like a ball of yarn where everything's talking to everything else.
                                                     

And I think that's the root of why we're here is you know, three things. You can figure out all the combinations of them, 50 things, the sun will burn out before you can figure out all the combinations of how they can fail. And so you can't just brute force it. It's hard to sit and think and reason about it because there's things you just don't know. There's the side effects that happen. It's a tough problem is the short answer. I don't have an easy answer for you here in terms of...
                                                     

I think part of it is understanding the problem, understanding the importance. One of my soap boxes is, I think about this all the time, and I think most businesses think about this when things are broken. The truth is there's a win-win here. They could think about it Wednesday afternoon, once a week, and they could get most of the value without having to invest time. But instead we kick it down the can. We kind of wait until things boil and then it boils to a point like this where we lose five days of half of, I don't know, it's 5 billion was the estimate. But we lose five days of how many, countless of thousands of people's time and efforts. So that sucks. But what if we all just spent 1% more time leading into that? Could we mitigate these? Could we have less of these?

Danny Crichton:                                        

Now, Sam, I'm talking about problem definition. I mean, so ball of yard is a great way to describe, usually it's called spaghetti code, but I like the ball of yard analogy. But you had a piece in the Atlantic, and we summarized this in an earlier episode, but just give us a really short precis of your article about complexity, software code, that you triggered with the CrowdStrike in the Atlantic.

Samuel Arbesman:                                      

Yeah. And the way I kind of think about any large complex technological system is they have grown beyond any one person's single understanding. And so as a result of that, there is going to be this gap between how we think the system operates and how it does actually operate. And oftentimes when we are not running these kinds of systems to improve resilience and things like that, we only discover that gap when something goes wrong, when there's a failure that reveals that issue.
                                                     

The piece kind of looked at what are the different ways in which we can begin to probe our systems and our technologies and our software to make them a little bit more reliable. Now of course, as Kolton was saying, that is probably never going to be possible in actually fully making these things reliable, but you can still reduce the amount of uncertainty and make these things more and more reliable.
                                                     

But I think overall it kind of requires a certain amount of almost technological humility and recognizing that these are systems that, and even though they're built by human beings, we don't fully understand them and how they all work, and so therefore we need to probe them and test them and inject errors and do all these kinds of things. Almost in the way that biologists study the living world and taking that sort of humble tinkering approach or naturalist for biology approach to really think about these kinds of large complex systems.

Kolton Andrus:                                        

I really love humility. I think that's one of the things when you get into this space, you learn. You learn that the likelihood of your hard drive on your computer right now failing, pretty low. But the likelihood of thousands of hard drives in a data center failing, starts to become daily. When you're running tens of thousands, hundreds of thousands of servers and moving pieces and network devices, and those are all things that could fail, failure becomes common.
                                                     

And that's where you start to realize, and as you said, it's very hard to get to a point where we could say, this system is bulletproof, this system will never fail. There's some reasons why we maybe don't want that, but to get to that would require us to go find and fix every one of these little issues and it's just untenable. And so that humility to say, okay, at its heart this is a problem that is maybe intractable, maybe very difficult to solve a hundred percent.
                                                     

Now you start talking about what does good look like? What is good enough? How far do I need to go? That's the slippery slope that we get into where good enough becomes good enough for too long, and as you said, then we think we've covered the gaps and we think we're covered, and then good enough goes until something goes wrong. Then we realize, oh, we're not good.

Samuel Arbesman:                                      

Right. And when it's good enough, it kind of breeds this complacency, which is exactly what you were saying. Which is like, we think, "Oh, this thing is working out well and so therefore we don't need to be contending with all these kinds of things." And the truth is whether or not you're aware of the fact that these systems are incomprehensible and incredibly complex, we are in that world. And so it's better to be aware of that thing from the outset so you can actually plan and try to make these systems robust and resilient as opposed to just being blindsided when these things actually do happen.

Kolton Andrus:                                        

So many times I work with a customer or we will do it with our own software. We dog food, we do a bunch of chaos engineering, reliability testing on our software and the number of times where we go, "Oh, that didn't go how it was supposed to go." And I think that speaks to that mental gap. Oh, we thought we were good until we realize we weren't. And the truth is, when you go in and you start kicking the tires and checking the oil, you find things, you find things that aren't perfect or maybe could be better pretty quickly.

Samuel Arbesman:                                      

When you have these bugs and failures, not only does it kind of reveal that gap between what you know, like how you think the system works and how it does actually work. But it also can sometimes even reveal certain things about the system that you hadn't even thought about in the first place. When I read about the ways in which, I think there's this idea of soft errors, cosmic rays can actually flip bits in systems. Now of course, that really doesn't matter for your laptop or whatever, but when you have these massive data centers, not only are hard drives going to fail, but the actual cosmic rays can have a meaningful impact on the reliability of a system. That's mind-boggling. And something that I feel like is going back to technological humility is also just very humbling and kind of a very different sort of sense as well.

Danny Crichton:                                        

The framework I always enjoy using from the RISC literature was Charles Perrow's, Normal Accidents from the 1980s, which his description is about big technological systems focused on complexity and coupledness. So as you get more and more components and as those components interact with each other, accidents don't become rare, they actually transform to something that's very normal. Which I think is what you're describing here. Which is, you have to actually build resilient systems to assume that they're going to fail. They're going to fail in different ways, they are different pieces. And yes, at scale, if you're at Google's data center or at AWS or any of the hyperscalers, you are at the scale in which a cosmic ray will knock out a bit of RAM at one of your data centers basically every day, like gamma rays do show up in that world enough that you actually sort of have to have error correction in ways of solving that sort of problem.
                                                     

Now, what I'm curious about Kolton is you've been in this space for 15 years from basically the dawn of the cloud era up until the present. And I'm curious, is this getting better? Is it getting worse? Are systems more robust? Because to be clear, the 2000s were not, in my view, a halcyon era of robustness of software. You would play video games. I was a different age that this is how I would evaluate the quality of software in the world, but video game servers used to go down all the time. You would have the fail whale on Twitter every other day because it couldn't scale up.
                                                     

And so I'm just curious from your perspective, having been so focused on this for a decade and a half, do you feel like we're just generally getting better or worse at reliability and software?

Kolton Andrus:                                        

I think we're getting better. I mean, that's the good news. First of all, everyone's tolerance for failure was much higher. I think society's tolerance for failure was much higher. My favorite analogy is you ever have to download AOL over dial-up or install the CD and download a webpage for a minute. No one would wait a minute for a webpage today. But back in the 90s, that was fine. That was normal. So I think 2010s, we were moving fast. It was the cloud. People were comfortable, "Hey, this is the new era. We're on the frontier. Things break."
                                                     

I think that's what we're going to see transition over the next five, 10 years is people are going to become less and less tolerant of that. People are going to expect everything to work. They're going to look at it like bridges and freeways and it's really our infrastructure, our digital infrastructure. They're going to expect it to work. And I would say we got a lot of work to do for that to happen.
                                                     

It's still a lot of duct tape and bailing wire. It's still a lot of throw bodies at the problem and have eyes on glass to make sure things go right. I think there's a mindset, if we're building reliable systems, if we're thinking about failure, the bulletproof glass version, but can we build a version that enables us to do the 80/20?
                                                     

There's a lot of things we've learned in the last 10 years, a lot of good patterns that have just been built into our software. Things like circuit breakers, I think things like error handling and backing off and retrying and exponential backup. Those are all kind of table stakes. Most people know about those. You're not explaining those to your common person. Now we're talking about how we isolate failure domains, how we keep it from propagating. Those are the lessons we're teaching, and I think those are the right lessons to be teaching the next generation of engineers.
                                                     

My one concern is I think do the decoupling. Most of your more junior engineers, they've never run hardware, they've never logged into the box and go change a SysFile so it could reboot. Could they log into... could they do it? And the answer is they could given an opportunity to learn, but are they learning? Is that part of their day-to-day? So I think there's... This is my flag to carry. My mantra is like, this is something that we got to invest some time in and it's not the world, but if we do a little bit of time along the way, we can mitigate a lot of the risks.

Samuel Arbesman:                                      

Do you think it's also to a certain degree that the stakes were lower where there weren't as many things in the cloud and so therefore people were more tolerant to this kind of thing? And now I'm kind of spinning a conjectural kind of anecdote, but maybe back in the day when few people had cars, it was okay if there were lots and lots of potholes, cause it didn't really matter. But then once, and once people started driving widely, then we had to think much more about the actual infrastructure. And so is there a sense that now that so much is being built and interconnected on these systems, that the stakes are much higher and we have to be more cognizant of all these kinds of things and we're sort of out of the adolescence of these technologies?

Kolton Andrus:                                        

I think that's a great thought. I think it makes a lot of sense. I mean, let's just look at the amount of business being done online. You go back to 2000, almost none. You know? You go to today, the vast majority, when the money's there, when it becomes important to all the people involved, the focus shifts there as well. And so yeah, 2000s, even early 2010s, still new, still fun. Twitter, who cares if it goes down? It's just people are up there posting 140 characters of who cares what. Now it's like Twitter goes down, oh my gosh, we lost a valuable news source for the world. We're going to not have insight into what's happening in parts. I think that just shows that kind of maturation and transformation occurred.

Danny Crichton:                                        

One of the questions I have for you, I mean you're saying that the engineering discipline is getting better, and I think this is really interesting because it parallels, I think a lot of the concerns in the cyber security community where it's like, look, we have to train people to think in a security way. The idea of bifurcating an engineer who builds software from a cyber security engineer whose job is to clean that software up to be secure, doesn't really work. If it's not secure to begin with, it's very hard to secure later in the same way that if it's not reliable to begin with, it's very hard to patch reliability on top of software later.
                                                     

Do you think that improvement is coming from better tooling up to and including the programming languages themselves and the libraries that sort of underpin them, the APIs? Or is that something that's more sociological that engineers are sort of learning both through coursework or through practice and companies how to be better at reliability or security, whatever the case may be?

Kolton Andrus:                                        

I think the tooling is a double-edged sword. Great tooling abstracts it away from you so you don't have to understand it. And so if the tooling does what it should, great. You don't need to worry about it. It can let you do more faster, and that's kind of what we are building. We built this chaos engineering tool and it was an expert tool and you had to be smart to use it, and that was unobtainable. That required too much for your average person. So we went and added a bunch of our expertise in it. So the average person can do more with it, but it also means they may not understand the fundamentals happening there or in our case, they've got to learn those fundamentals anyway. "Oh, hey, how does the network traffic behave here? And I need to understand that." So I think the tooling plays a role, but I think it's really culture.
                                                     

I think to professions that have been around quite a bit longer, doctors, lawyers, have been around for centuries. There's a lot of history to that craft. There's a lot of things that have been learned. Computer science is, what? 70, 80, 90 years old? We're still pretty young here and it feels like we're young at times. It feels like we're moving fast and breaking things without a lot of rigor, without a lot of structure, and there's a balance there don't get me wrong. I'm not for imposed structure, but I think that structure should come from within.
                                                     

One of the things I loved about Amazon's engineering culture as it was explained to me, was every engineer is responsible for writing performant, reliable, efficient code. Hard stop. Doesn't matter what you're doing, those are the pillars that you write your code upon. I really enjoy about working with other folks that have that mantra is that it's important for them to do it right.
                                                     

It's not, "Oh..." To your point, it's not, "Oh, I'm going to go bolt on reliability. I'm going to get it all done and then I'll think about reliability." It's first principles. While I'm designing the system, thinking about reliability, I'm thinking about security. I'm thinking about how it scales and maybe not in the junior engineer trap of over-engineering it upfront, but I'm giving some thought to how will this develop over time. That's what I hope comes out of every one of these outages is engineers saying, "Oh, you know what? I could do better. I want to do better. I don't want this to happen."
                                                     

Maybe an unpopular opinion, here's your hot take. There's a lot of hug ops kind of going around feeling bad for the CrowdStrike team. Me personally, I felt bad for everybody that got stranded in an airport. Let's talk about who felt that pain. Yeah, that was a bad day at work, but you went home at the end of the day. These people could not go home at the end of the day. And so there's a "With great power comes great responsibility" quote here, and if you have the ability to take down a third of the world's computers, do better. Sorry.

Danny Crichton:                                        

Let me ask as a follow up there. There's always been this debate about engineering as a discipline and as a profession. So in many countries, Canada being one of them, civil engineers, structural engineers are certified. They have licenses. You actually have to pass qualifications. You have to pass a licensing exam. To use the word engineer in many countries where it actually requires that you hold a license, you can't otherwise describe yourself as an engineer. In the same way that a doctor must pass through quals and exams and USMLEs or whatever the case may be.
                                                     

In the United States, we don't have that for software engineers. If you're a civil engineer, you do get sort of inducted into the profession. You have to build a bridge and there's professional responsibilities. I mean you can be charged with fraud. You can suffer serious personal penalties for failures of a bridge or a structure that you design. Yet when you are a software engineer building critical reliable systems, if those fail, there's sort of no conse... I mean you can lose your job, but there's sort of no deeper consequence beyond that.
                                                     

And I'm curious, is that part of the lack of maturation here that we've been building bridges since the Roman times and so it's just a little bit more mature and we have a better sense of what the architecture of the field is and computer science is just young. Or is that sort of, that kind of, there's so many youth. I take it as there's so many young people going into the field all the time and changing everything and mixing it up, that bridges are so stable in a way that computer code may never be.

Kolton Andrus:                                        

I guess my point earlier that this is our digital infrastructure and it is becoming to the importance of our physical infrastructure. Lives will rely upon it, not just money, not just government, not just communication, but lives. And so if a life is on the line, yes, I think it's worth some additional rigor there to make sure that the right things are being done.
                                                     

You sometimes see this distinction in discussions with programmers and engineers, and those are the words I would use is programmers and engineers. Engineers I think are thinking about more than just writing code. A programmer is writing a program to get it to do something, and an engineer is thinking about how to build an architect a system so that it will maintain, so that it will withstand, so that it is able to not just its purpose, but fulfill its purpose over time. I think my caveat here, I've known some non-college-educated engineers that have taught themselves and are exceptional. And I've known some college, college-educated programmers that never really leaned in and learned the craft well.

Danny Crichton:                                        

Your mileage may vary.

Kolton Andrus:                                        

Your mileage may, I mean that's... So, you know, do we...
                                                     

So it comes to the second part of your question, should we have a test? Should we have a standard? Should we hold people accountable to it? Well, I think we don't know what good looks like today, so anyone going to put that together is kind of guessing and debating.
                                                     

I've been looking at a lot of the regulations coming out of the EU and the US toward digital resiliency, and you know, I don't think they're written by engineers that build and operate distributed systems. I've read them and they're not very helpful. I think I could write a one-pager of like, "Here are the 10 things you need to do." I have. "To build a reliable system. Here are the things we check, here's how we verify." I think maybe socially, politically, we need more engineers that have real-life experience guiding that kind of regulation or that kind of testing for it to be effective. Otherwise, it becomes bureaucracy and then I think it slows us down without adding the value we're really looking.

Danny Crichton:                                        

Sam, I mean you're writing a book on the magic of code, which in some ways it sort of, at least in my perspective, it's like the polar opposite of this. This is about infrastructure and core systems and a lot of what you're doing is talking about experimentation, the poetic web. Something that we've talked on the orthogon of that, quite a bit over the last couple of weeks. But I'm curious how much of when you think about the profession of code is about that experimental culture versus this is civil infrastructure that runs the world and can those two cultures intermix within the same field?

Samuel Arbesman:                                      

That's a really good question and the way I kind of think about it is, I mean each one sort of draws from the other and also they're not as distinctive, I think as we often think. So it's not that, okay, here's all the people having all the fun, and then here's all the people building all the boring things that just make the world work.
                                                     

First of all, the actual process of building software itself as an engineer endeavor, still requires understanding all the kind of fun and exciting and wondrous properties of software. But separate from that, they're the ones building the infrastructure that allows the people to build all the weird things that might have lower stakes or kind of be strange. And in turn, sometimes the kind of weird playful approaches can actually be used to better understand those engineered systems. And so for example, oftentimes people who are at the intersection of art and computing, they actually are really good at playing with some of these interesting systems very early on and helping figure out what are the boundaries of these systems, which can then be incorporated into some of these kind of engineering endeavors.
                                                     

And so I think there are distinctions, but I think there's kind of a nice interplay between all of this. But at the same time though, I agree that when it comes to things that matter and have high stakes, you can't just say, "Oh, we learned something. This is awesome," and kind of move on. You can, but you have to make sure that the learning allows you to prevent that kind of thing from happening again and reducing the chances of failure moving forward. Versus saying like, "Oh, there's a really cool bug in this game. That was cool and it taught me something about how the system worked." That's fun. It needs to be the precondition for actually something meaningful.
                                                     

Which I think is what Kolton was also saying about. That like with chaos engineering, these things are necessary but not sufficient. You really need to make sure the community of individuals building these systems really is empowered to actually make the changes and kind of imbibes this idea that finding faults and failures is the first step and not just the last step in learning about how a system fails.

Kolton Andrus:                                        

You nailed it. You get it, Sam. That was the problem with chaos engineering. "Wow, this is fun. Wow, I'm going to go find some bugs. Wow, cool." And the end goal is not finding bugs or causing failure in an interesting way. I do think there's some of that creative aspect you talked about. We've been building and running software for a while and someone came along and said, "Let's break things in a new way." And I think the idea of breaking systems isn't new. We were doing that in various ways of testing, but it put a new spin on it. There was some new creativity brought in there that we brought back that made the core engineering better as a result. So I think it is good to have that type of experimentation.
                                                     

The problem is, and there's a little bit of a marketing problem here, chaos engineering, everyone goes, "Well, I don't want chaos. I got a bunch of chaos. You're selling me chaos. I don't need chaos, bro. I need reliability. That's what I need." In the name and in the focus, people are perhaps thinking about the wrong thing and the end goal is boring, reliable systems that we are well tested, that are well understood, where we understand the sharp edges, where we get alerted if we're concerned. We've taken care of the 80/20.
                                                     

It's not that we fully exhausted that 20%, we found everything that could happen. It's that we go through and we do this boring set of things. There's like, hey, there's 10 things everyone should do when you build a software app, do those 10 and then once your homework's done, then have dessert and go out and play and go find something fun on top of that.

Danny Crichton:                                        

So one of the questions I have for you, I mean obviously we're talking about software and we're talking about reliable systems, but this idea of chaos engineering, of going in fiddling with the systems of pressuring it in different ways. So if there's 16 subsystems, what happens if a subsystem breaks or just disconnects or the internet goes offline or the power goes off, what happens next and does the system break or has it resilience, does it heal, et cetera? How much does this only relate within the engineering world or can it extend into the social world? I'm thinking like business, society, politics, other arenas where this idea of building more reliable systems could be applied outside of the code box?

Kolton Andrus:                                        

Yeah, you and I have chatted about this in the past. I think this concept applies well. First of all, the idea of chaos engineering at its heart is we want to go run an experiment, maybe a risky experiment that we think is going to teach us something new or help us understand the failure condition.
                                                     

I think one quick aside, a lot of engineers look at the system in happy case and maybe we can apply this to society. We look at when things are running well and we're like, "Oh, everything's great. That's how we want it to be." So we set our alerts and we set our expectations around that. But really when things are wrong, it calibrates us as a society into how much pain we're willing to tolerate on when we need to step up and get involved or take action or make change. And maybe that's the operational analogy.

Danny Crichton:                                        

Well, it used to be that states where the laboratories of democracy, and in some ways they're like canary deployments. You try a policy in one place, you see if it works and functions or not. If it does, you spread it to more. If it doesn't, you pull it back. CrowdStrike now says that they will use canary deployments for its patches, which as someone pointed out is sort of absurd because that is one of your best practices and when you run so many computers around the world, it's sort of exceptional.
                                                     

Let me change it to a final subject. So obviously when we think software and computer code, you're thinking binary, zeros and ones, very definite, very concrete. You write code, does exactly what you tell it to do, but with the rise of artificial intelligence and AI influencing bot development, prompt engineering, all these new categories we've never even had phrases before two years ago. All of a sudden we kind of are going from this world of, I write code, does exactly what I think. To, I have a prompt that goes into an AI box. It's a black box. I don't really know what it's going to do. It comes back to me in variation. Sometimes it work, maybe 99.99% [inaudible 00:29:20] four nines, reliability works, and then other times it gives me complete junk. How does this sort of rise of AI when you're thinking about chaos engineering, reliable systems, distributed affect some of the work that you're doing with either gremlin or some of the theory you've been building on the intellectual side?

Kolton Andrus:                                        

Yeah. So I think first off, there's business for us to be had there because if you build a fully AI generated system, I'm going to recommend you heavily failure test that to make sure you understand all the edges and it works correctly. And really, actually, this is where you need kind of a black box testing approach. We can't ask AI to write the unit tests and the code and then believe that they're all good without review. You can't just trust without verifying there. So as you said, software does what we tell it to do. Sometimes what we don't intend, but what we tell it to do.
                                                     

I think the problem that we have with AI software is we all need to be excellent product managers if we're going to start using AI. If you start looking into these prompts, you start looking at what's generated. We see a lot of these real snippety one, two line examples, but were you to have it generate a project, you would essentially build a product specification that you would feed into the AI and how good of a result you get is going to be how specific you are in your specification.

Samuel Arbesman:                                      

When it comes to incorporating AI into large, especially generative AI into these larger systems. I'm reminded, so a friend of mine, Rohit Krishnan, he refers to these AI systems as fuzzy processors and I feel like there's the deterministic processor that does exactly every single time the exact same thing. And then we have these fuzzy processors to immediately say, okay, because this technology is new, we're going to replace all regular processors with fuzzy processors is just a misunderstanding of what to... Versus saying, okay, this thing, this new technology might be good for a certain specific subset of use cases and let's figure out how they can be useful there.
                                                     

But by and large, yeah, fuzzy processors, they're not going to be good for all the things that we've been using the traditional computing technology for. And so figuring out that balance, it's going to take a while, it's going to take some wisdom, but it's not going to be one of these things of like, oh, this is new. It's now going to replace everything. Therein lies the path of folly and hubris.

Kolton Andrus:                                         Yeah, the fuzzy bit is key there. The non-determinism, "Hey, do you want fuzzy math done on your finances?" Cause I do not. I mean, you might. Maybe it comes up in your favor, but maybe not.

Danny Crichton:                                        Well, talking about hubris, I mean, that brings us back to humility and the idea of if you're going to code these systems as they get integrated into more and more critical, I mean at this point, healthcare hospitals. I mean, once CrowdStrike went out, I mean emergency rooms were shut down in many parts of the country. I mean that's the scale that we're talking about. And I think as a computer programmer myself, I mean one of the things as you're coding is you just don't... Even if you may vaguely have a sense of like, an ER runs my software. I feel like there's no way typing into a keyboard, that you have that visceral sense of like, there's an operating room where this computer is located and the code that I'm writing here has a direct effect on what happens in that room. That abstraction, layers of abstraction so divorce you from what's going on in the real world.
                                                      To me, that's the hardest part of coding. And we have entire professions like product managers who are meant to bridge from, here's a user in the real world and here's how they actually use your software to try to tell you how to kind of influences your software. To me, that remains core in a way that, to go to the earlier part of our conversation around civil engineering. You build a bridge, you actually see the bridge. It's going to get built, and you're going to feel it viscerally in that way, that tactfulness. That you just don't get that kind of tactile feel in software engineering. But I think with that, Kolton, thank you so much for joining us.

Kolton Andrus:                                         My pleasure. Thank you very much for having me. Wonderful conversation.

Danny Crichton:                                        And Sam, of course, joining again as well.

Samuel Arbesman:                                       Thank you. This was great.

continue
listening