Oddly Influenced | Transcript: E37: Resilience engineering with Lorin Hochstein

E37: Resilience engineering with Lorin Hochstein

August 14, 2023 / 44:36/E37 Download MP3

Welcome to Oddly Influenced, a podcast about how people have applied ideas from *outside* software *to* software. Episode 37 – Resilience engineering with Lorin Hochstein

≤ transition music ≥

I like to think of myself as living a life balanced between the nitty-gritty impatience of industry and the deep thinking of academia, and it seems Lorin Hochstein does too. He’s a PhD who gave up a tenure-track professorship for industry, including the very nitty-gritty of being a senior software engineer at Netflix. It was there that he became interested in resilience engineering, and it seems to me he does a good job of applying the insights of an academic(ish) field to the realities of production software.

Our specific topic here is how to handle a complex system falling down hard and – especially – how to then prepare for the next incident. Because there’s always a next incident.

I really enjoyed this interview. It maybe stretches the theme of the podcast – although resilience engineering draws from aviation and emergency medicine, the connections are not as *weird* as is typical, but I don’t care. It’s interesting, and that’s a good enough excuse.

≤ music ≥

Brian Marick:
Today we have Lorin Hochstein and we're gonna talk about resilience or resilience engineering. And we're gonna start from a paper called “The Theory of Graceful Extensibility: Basic Rules that Govern Adaptive Systems by David D. Woods” from 2018. Where we end up, who knows. So hi, Lorin.

Lorin Hochstein:
Hi, Brian.

Brian Marick:
I'd like to start with something I do, which is to summarize what I think I understand, and thus give you an opportunity to tell me what I've got wrong. I think the topic of resilience deals with a particular kind of problem. And since a lot of the examples in resilience engineering seem to come from emergency rooms, and since my wife used to be a clinician at the U of I vet school and did emergency duty, I wanna start by saying what I think resilience is *not about*. When she got emergency cases, there would be... crises, events. In the jargon, what they say is an animal is “trying to die.” And so they have to stop the animal from dying. Or on one memorable occasion, I was there and a large bovine got loose in the ward and was running around this sort of circular area. So she said, “You! Climb up on that half wall and get out of the way.” And they then managed to corral the cow and get it into a stall. All of those things are crises in some sense, they're, they're events, but they are unexceptional. They're things that they know how to deal with. They're well within their normal competence.

A second example is one day one of those giant tractor-trailer semi trucks with a very large number of cattle in it overturned on the highway just north of town. So they had to go and deal with it. Some of the animals had to be euthanized. They basically had to go out there and do triage. And that's not the normal thing they did. It only happened once in 30 some odd years. I imagine that being something of a different nature of crisis.

But it seems to me that resilience... the thing that this paper seems to be most about, is not a single event, but a series of events: one after the other, after the other, after the other, that stresses the system to the point where at some point it just breaks suddenly. Is that the problem domain?

Lorin Hochstein:
Yeah, sort of. Resilience deals with cases the system's not normally designed to handle. That we didn't really think about in advance. It's sort of at the edge of what the system is able to really deal with. So when we design our systems, we design it to handle certain cases. And when you hit a situation, a crisis, an incident that stretches you just to the boundary, just to the edge of what you can normally handle, just past that, that's what resilience is about. So crisis is a good term because crises are exceptional.

Brian Marick:
Mm-hmm.

Lorin Hochstein:
So resilience deals with cases that [systems] weren't explicitly designed to handle. This particular paper is another meta level above that. Okay, a system can handle [a crisis] once, but how do systems keep being able to do this over time? So you may at one point be able to handle an exceptional case, and maybe you then over-adapt to that case, and then the next one might break you. But some systems seem to be able to keep adapting. So resilience, part of it is about how do systems adapt to deal with situations they weren't explicitly designed to handle. And this particular paper is about sustained adaptability. How do they over time keep being able to adapt, keep being able to stretch every time they hit exceptional situations? Does that make sense?

Brian Marick:
Yeah. In the paper, he describes three things that he calls failure modes, and I'm not sure I've got them right. The first one is called decompensation. What does that mean?

Lorin Hochstein:
To understand decompensation, let's talk about compensation. Compensation happens when you have an additional stress on the system and then you do something to compensate. So in the software world… I've dealt a lot with cloud systems. You can imagine a software service that's getting an increase in load. So it starts to scale up, right? I don't know how familiar you are with auto-scaling in the cloud, but basically you can request more compute resources. That's a form of compensation: you've had some additional load on you, and you are compensating to deal with that. Decompensation is when you hit that limit and then you can no longer effectively compensate. So you can imagine an auto-scaling system that hits the max [but] it keeps getting more and more load, and it can't scale up any more. So what does it do? How does it deal with that situation? And some of them decompensate nicely, gracefully: like shed load. And some of them sort of fall over.

But once you’ve hit that limit, does the system bend? Or does it break? Decompensation is: it breaks. It hits that limit, and it stops servicing all requests, say.

Brian Marick:
Okay, and the second failure... Actually, before we get to that, his notion is that there is a tangled, layered system of units [but I forget their acronym]. Do you remember?

Lorin Hochstein:
UABs, units of adaptive behavior.

Brian Marick:
So you have these units. As I understand it, units in the case of, say, an emergency room might be physicians…

Lorin Hochstein:
Yep.

Brian Marick:
… nurses, interns, technicians. In the case of the overturned trailer truck, physicians and interns and such, plus also police – who, my wife gets kind of salty about how helpful they were *not*. Those particular units – UABs – they didn’t decompensate; they just stood around. A second kind of failure mode is “working at cross purposes: locally adaptive, globally maladaptive.” Can you expand on that?

Lorin Hochstein:
Sure. So here think about units as being, let's say, teams in a company. Individuals are a good example of [UABs], but also you can think of groups or organizations. If you and I are on different teams in a software company, and you and I are both trying to locally optimize for something, but it is not globally optimal for the actual company to do things that way. This is sort of a trivial [example], but you wanna buy some piece of software, some tool that will make you more productive, that will make the company more productive. But someone else is optimizing for keeping their budget at a certain level and they say, no, I'm not gonna approve that. Think of penny-wise, pound-foolish as an example of [working at cross purposes]. We're all, each individual, trying to optimize for whatever goals we have. But really, the point working in an organization is not to optimize your local thing, it's for the whole company to succeed.

Brian Marick:
An example that comes quickly to mind because I'm working on a script about the history of Context-Driven Testing. In the bad old days when people used to ship software, [like] a new version of Excel, once a year or so. Toward the end of a project, the development side of the house was very intent on hitting the deadline. The testing side of the house was intent on finding bugs. And that led to various kinds of maladaptive behavior, like the tendency of people to close bugs as “works on my machine”, “works as designed,” [and] so on. And in some cases, I can definitely see that sort of spiraling as... as the developers were meaner to the testers, the testers were meaner to the developers, and so on and so forth.

Lorin Hochstein:
The whole DevOps movement is about how devs want to get features deployed to production and ops does not want to break production, right? And so you saw people working across purposes. And I think that sort of spawned that movement.

Brian Marick:
OK, and the third kind of failure, this is the one I'm not sure of, but my note says “models of adaptive capacity get stuck.”

Lorin Hochstein:
If you have ever been in an incident, there's some software service and it's down, it's not working properly. And you have some theory to why it's broken. Oh, it's broken because service X broke, because I remember service X broke six months ago. I'm sure it's service X. And you get stuck, and you don't realize that no, actually your theory about what's broken is wrong. But you keep investigating that one line instead of more broadly. So this is often called “fixation” where you're convinced that the problem is in one area.

Brian Marick:
Hmm. Yeah.

Lorin Hochstein:
In the pandemic, there was this theory early on [that COVID was] transmitted through fomites on surfaces rather than through the air. And they sort of got fixated on that being the transmission mechanism. That is a very common failure mode. People get fixated on like one avenue and don't investigate more broadly about what the possible problem could be.

Brian Marick:
So the focus of this paper is on theories, I guess, of how an organization can both not decompensate now – not have a failure in the particular situation you're in – but also how it can learn to be the kind of organization that doesn't allow problems to turn into crises.

Lorin Hochstein:
Yeah, or is effective at dealing with crises. I think crises are unavoidable in some sense, but there's a question of how effectively you deal with them when they happen. And so how well positioned is an organization to deal with that crisis once it happens?

Brian Marick:
There was an interesting phrase that I highlighted: the phrase “poised to adapt.” What is he saying there?

Lorin Hochstein:
He’s saying that you have to be ready to change the way you do things when something goes wrong. You need resilience when the way you normally do things isn't working, right? And so you have to be able to improvise and do something different than the way you normally would. You don't know exactly what you're you're going to have to do something different. And so the ability to improvise a solution generically is important.

I remember at one point at Netflix where we ran into trouble with some of the servers, and we had to basically redeploy a very large number of services. There was some data that had gotten corrupted in some data feed that was being consumed. And we didn't have tooling to do that. We could redeploy one service, but the system was not designed to deploy N services, where N is a large number. We got into a room, we started a Google Sheet where we kept track of all the services that need[ed] to be redeployed. And then we farmed out [the work]. People had to contact the individual service owners to do that.

Being able to come up with a quick solution to solve that particular problem [was] an example of being ready to adapt. So we had the war room that we could call people into, and we had incident responders and stuff. So we were ready to bring people together to solve a problem even if we didn't know exactly how we [were] going to solve it.

Brian Marick:
So you both had the infrastructure that allowed you to do that – there was a dedicated war room for such things – but you also had people who were capable of doing that. It just occurred to me one of the things my wife said when I was talking to her earlier [about the overturned semi-trailer] was that they they called up a bunch of farmers and had them bring their little cattle trailers out. Now I think of it, it was kind of like Dunkirk with all the little sailing ships going and rescuing the soldiers. You had all these farmers coming and rescuing the cattle.

Lorin Hochstein:
Yeah, that's a great example. So when you hit these problems, often you need additional resources. Ideally, you've previously invested in them, and you have them ready to deploy. But in some cases you don't have that, and so you have to figure out how are you going to marshal resources to do this. So [yours is] an example of coming in and marshaling resources that weren't explicitly [designated] for this, but you had access to people who could bring them in. Being able to bring in resources dynamically like that to help solve a problem is a big part of resilience.

Brian Marick:
How do people get good at this?

Lorin Hochstein:
That's a great question. I would say – as a practitioner – we think people get better at it by reflecting on the crises that have happened and how we dealt with those. There's a movement, I guess you could say, called “learning from incidents” that is big on this. People who… we don't know exactly what they're going to need to be able to do, but they [have to] understand the system well enough that they can improvise effectively. Learning from incidents is a great way to do that.

Brian Marick:
Yeah, I do occasionally read the various case studies that people put up, which I think is amazingly public-spirited for people to say “Here is this horrible thing that happened to us and…” Actually, the aviation industry is pretty famous for that, are they not?

Lorin Hochstein:
They are, yeah. They're famous for detailed incident reports. And they're also very famous for reporting of near misses. There's a system where people can anonymously report anomalous things that they have seen. They gather a lot of information, not just about big things that have happened or big near misses, but even minor things. There's a lot of signals that they are collecting, qualitative signals. And they try to learn from those.

Brian Marick:
It probably shouldn't need to be said, but probably does need to be said: I think that the NTSB, the National [Transportation] Safety Board, are famously *not* about assigning blame. And… You nodded, so I take it that is correct.

Lorin Hochstein:
Yeah, I believe that's the case. In transportation, the incidents are public, right? And so they can produce reports. One of the one of the challenges in our field is that you're not compelled to do a public writeup. Some companies will do that for their customers. Some of them, as we know, will do it publicly. But not everyone does. My former employer Netflix does not do public incident write-ups. I think you see it more in companies where their customers are software engineers. I think it's like a confidence building thing.

Brian Marick:
Mm.

Lorin Hochstein:
Whereas the ones that are consumer facing generally tend not to do it as much.

Brian Marick:
I imagine the “hallway track” at conferences for people like you has a lot of interesting stories.

Lorin Hochstein:
Oh yeah, yeah. And there was recently the first Learning from Incidents conference. [It] happened back in February, and they had an off the record track. So they tried to actually bring [the hallway track] to more traditional session where people could talk about it. The LFI folks took that from the security folks that have done that for a long time. They've had off-the-record [tracks]. But yes, the hallway track is always where you hear a lot of the interesting stuff. People love to talk about stories, right?

Brian Marick:
So we probably cannot get you to tell inside stories of Netflix here for broadcast.

Lorin Hochstein:
Ha ha. I can say *some* things. There's some things I can't say, but I was on the central incident management team there, called Core, for a little over a year. And so I saw a few interesting ones.

Here's an example of getting stuck in a stale model. There was an incident where I was incident commander. One of the symptoms was an increase in TCP retransmits. Some people were convinced that the problem was a networking problem, and they were “let's bring in AWS.” And it was not, actually. The problem was not networking, but the symptom was you saw a networking metric go up. There was actually even a big argument at one point about that in the war room. I remember that very vividly. And that's a good example of fixation: you see a signal and you say, okay, [the] problem is *there*.

It turned out that actually it was CPU usage was really high, and it was starving the networking stack. And so that's why it was losing some things and retransmitting. Netflix does tracing of requests through our system. And because there's a lot of traffic, we sample the traces. So only 0.1% of requests are actually traced. There had been a bug and it had accidentally increased to like 100%. So…

Brian Marick:
Haha.

Lorin Hochstein:
… every request was being traced. And it was very difficult to figure out what the heck was going on, because all of a sudden CPU usage was rising across all different services and we had no idea why. That was a very interesting and challenging incident.

Brian Marick:
I actually don't know much about this, but physicians famously do differential [diagnoses] where they list a set of “here are the various possible causes in some rough order,” and then they proceed to try – in a fairly systematic way – to find out which of the differentials is the most likely. Do [you have] – not rote, but methodical – procedures for doing this, based on past experience?

Lorin Hochstein:
I would say we don't have process like that. One thing that we end up doing though, is building dashboards that plot different signals. And those dashboards are updated over time based on previous things that have happened. Oh, in this incident, it was a, I don't know, memory leak or whatever. We should watch memory usage more closely. And so the dashboard structures are sometimes an artifact of history.

And that's a challenging thing. One of the challenges is the “fighting the last war” problem where you over-index on the thing that happened most recently. And so the better dashboards are actually more generic… they help you narrow down where the problem is. Whereas ones that are less well-designed, they're sort of a history of previous incidents. “Is it this one?” No. “Is it this signal?” No.

But I would say that generally tends to be where the knowledge is encoded: in the dashboards. I have not seen people use a process to figure out what's going on right now. I know Brendan Gregg, who is a performance engineer who used to be at Netflix – I think he's at Intel now – has documented a process called USE: [Utilization], Saturation, and Errors for troubleshooting. But that's not so much used. I would say during an incident, it's tricky because you have to like stabilize the patient as well as… You're not necessarily trying to figure out what exactly is [the] problem. You wanna get the system healthy, right? And then you can figure out exactly what's going on. That's a part of resilience engineering too. You're doing these like diagnostic interventions, but also you're doing therapeutic interventions. Even if I don't know what's going on, I need to make sure [the system] is still up and servicing customers.

Brian Marick:
Yeah, it occurs to me when I was first courting my wife, 35 years ago, I thought “how nice it is that if something goes wrong with my program, I can just restart it. But you can't reboot the cow and put it into a known good state and run it forward. And that's very true of your field, right? These systems don't get restarted.

Lorin Hochstein:
No, right? We cannot cold-start the system. You cannot turn these large systems off and back on again. They're much more like organic living systems.

Brian Marick:
One other thing about the NTSB. I'm less sure of this, but they don't look for root causes per se, *the* cause of an accident, because there's never *the* cause. There is a collection of unlikely events. Is that also considered the case in resilience engineering?

Lorin Hochstein:
Yeah, that's definitely the case in resilience engineering. My Twitter handle is noRootCause because of that. It’s deliberately chosen because the perspective in resilience engineering is that it is always a collection of contributing factors. People argue about this a lot. I think it is really like a perspective, like a lens. You can always point to something and call it a root cause if you want. You will learn more about the system, you will get better if you recognize it as a set of contributing factors where not any one of them is *the* cause. And the reason is: the goal of looking at these incidents is to get better in the future, right? And you don't know which of these factors – like, learning about it will help you next time. It might be that factor A was not the root cause this case, but it's going to be a problem next time.

One example that's big in resilience is this idea of production pressure. No one ever has enough time to do the work at the level of quality that they want to. We're always squeezed for time. Because we've got to get stuff out. The company has to stay viable. You're never going to call production pressure *the* cause, but it's endemic. It's everywhere. If you don’t see it, if you say, “Well, he screwed up here.” You know, he had a limited amount of time. If you have less time to do tasks, you make more errors. So you’re never going to see production pressure if you just identify a cause. [Production pressure] will never be the cause, it’ll always be, “well, the person made an error here.” But if you understand the role that production pressure plays in the way people make decisions, you will get a better understanding of how incidents happen.

Brian Marick:
One of the things that – I’ve forgotten the name of the author – emphasizes is that the pressure for optimality or efficiency, pushes against what he calls graceful extensibility. So first, what is graceful extensibility?

Lorin Hochstein:
“Extensibility” actually comes from software. I mean, we build our software to be extensible, right? We can modify it over time. And old code can call new code, for example. So we want our systems to be extensible over time. And the nomenclature comes from “graceful degradation.” Graceful degradation is when hit your limit, you don't just fall over. For example, I don't know if you've ever used a streaming service, but it remembers where you were when you come back to a video you watched halfway through. [In Netflix], there's a bookmark service that keeps track of that. If that service fails, it shouldn't prevent you from watching your show. Graceful degradation is: it just starts at the beginning. The quality of the experience is degraded, but you don't just get an error thrown in your face.

So the idea of graceful extensibility is when a system hits the limit of what it's designed to do, it is able to extend that limit a little bit. It's able to do a little bit more than it was actually designed to handle in order to deal with the current crisis. Woods talks about extending the competence envelope, being able to handle a little more. Imagine that your system is at maximum load. You can't scale up anymore. So you make some change that you weren't really designed to handle so that you reduce the CPU usage or whatever, so you can handle a little more load. That's an example of graceful extensibility: the system's ability to change itself when it hits a limit to be able to deal with that limit.

Brian Marick:
When you're saying system, you're talking about a socio-technical system in which the people are always? usually? the actors that are causing the extension, possibly using things in the technical side to give them levers that they can push on to cause a little bit of extension.

Lorin Hochstein:
Yeah, that's exactly right. And that's the idea of the Units of Adaptive Behavior. They're able to do things they weren't explicitly designed to. People are very flexible, right? And so people can go in and change the way they work. They can change the system. We don't know how to build software today that can adapt that way. Maybe in the future, the software could be adaptable. But today, the units of adaptive behavior are invariably human-based because we're the only ones that can really adapt at that level.

Brian Marick
Oh, but Large Language Models will be doing it for us any day now.

Brian Marick:
So let's see what else I have in my notes. I'm just gonna throw out phrases that I wrote down and you say something wise about them, or...

Lorin Hochstein:
Ha ha!

Brian Marick:
… I’ll just edit it out if you don't have anything wise to say. So one of the phrases that struck me was “miscalibration is the norm.”

Lorin Hochstein:
So, we all have a mental model of how the world works, how the system works. And it's always incorrect in some way. One of my favorite examples of contributing factors that I saw at Netflix was: I'm on a team. And then there's another team that deploys on Wednesdays, say, and I know they deploy on Wednesdays, and I need their code to go out before mine. So I'm going to deploy on Thursday, but actually this week, something happened and they were late. They had to deploy on a Friday. I didn't know that: I deployed on Thursday. It broke. So I was miscalibrated: my mental model of when they deploy was wrong in that particular case. Basically, we all have models of how the world works, and our models are never perfect. And they're always incorrect in some dangerous way that we don't know of in advance – until it bites us. That's sort of what “miscalibration” is about.

Brian Marick:
That gets back to “poised to adapt.” I like to use the model of tango because my wife and I used to dance tango. Tango is a dance where you are in balance, and then the leader pushes or pulls the follower off balance, and the follower has to turn that movement into a graceful transition to a new point of balance. And you just keep on doing that over and over again. The problem the follower has is that [at] any given point of balance, the leader will have, let's say, four different possibilities [for] the next push or pull. If the follower tries to anticipate what the next move is going to be, 75% of the time, they're wrong. And then they've already committed to a movement they've, well, they've over-fixated. Thus, they won't be able to perform the actual move gracefully. So a hard thing for a tango follower to learn is to just let the leader move you. There's a particular look that followers have, called the tango look: which is this look of being poised. A lot of tango training, at least from the follower’s point of view, is learning this sense of receptivity. And it's an explicit part of the training. Is there any kind of movement for that to try to teach people who are doing resilience to have that “poise to adapt”? (That sounds very “woo”.)

Narrator:
While editing this episode, I cringed at my description of tango. I made it sound way too crude, what with all the pushing and pulling. It would be more correct to say that a good leader suggests to the follower how she might put herself off balance. A lot of the *point* of tango is subtle and two-way communication. See the show notes for a better description, taken from David Turner's /A Passion for Tango/.

The point about the need to be poised – the need for the follower to be reactive rather than proactive – still stands.

Lorin Hochstein:
I love that example, tango dancing. I think of two things. One of them is that there's a temptation to try to plan out everything. Which you can't do. So “run books” are a good example of that. “We’re gonna pre-define the problem scenarios. You can go and look them up and follow the instructions when things happen.” And we run into trouble when you have a problem that the run book was not designed to handle. You can't rely on a run book, you need to be able to improvise, you need to be able to handle anything, not [some] particular failure mode.

The other thing – and it's a very big thing in resilience and you can see it in this paper – is that you can't solve these problems alone, so you always have to coordinate with other people. Your understanding of the system is partial, mine is partial, but together we have a better understanding of everything.

Coordination is really hard. It's a very very difficult problem for human beings, and the more you know your teammates, for example, the better you’re going to be able to coordinate with them effectively. You can see it today, especially in the distributed world we have, we don't have a war room anymore. It's all over Slack and video chat, and it's not the same. The bandwidth of communication is lower.

And so being able to coordinate effectively is very important, and it's a hard problem to solve. And all these things, because they’re kind of generic, it's practice and experience at doing it. I mean, you get better at these things by doing them more.

Brian Marick:
I should note that the leader has more of an opportunity to plan, but if you plan… “I’m going to do the following moves that are going to end up in this graceful back ocho that [she’ll do].” And as soon as you execute step one of the plan, some clown dances into the spot you wanted to go to. So learning how not to get trapped by plans seems to be kind of a consistent theme of my life and of the software industry over my lifetime.

Lorin Hochstein:
One of the problems is being able to adapt quickly. One of the problems you sometimes see is that people are not empowered to make decisions. They have to run it up the chain, and the world is moving faster than your ability to get approval for things.

Brian Marick:
Mm-hmm.

Lorin Hochstein:
And that's another example of decompensating, where you cannot keep up with events because it’s taking you too long to make changes because you have to get approval from above. But you can't just push all the initiative down and not coordinate because then you have the cross-purposes problem. You have to figure out how to solve both those problems at the same time.

Brian Marick:
I just realized that I didn't finish part two of the “pressure for optimality versus graceful extensibility”. This is kind of the old tradeoff where the more efficient you are, the less poised you are to adapt. How do you get people to understand that when they're setting budgets? And what does it mean to set a budget to be more gracefully extensible?

Lorin Hochstein:
I wish I knew the answer to that. I remember listening to a podcast by a safety guy. He was on a flight, sitting next to someone who worked at one of the shipping companies like UPS or FedEx. And they actually fly an empty plane around the country that they can deploy if one of them becomes unavailable. And that is not efficient, right? That's that's an example of having additional capacity that you can deploy. And it's expensive.

One thing Netflix does well is they have this core team, this team of engineers, that are basically there on standby during an incident. You could use them to build software [but] they're not doing high priority software work. And that was justified.

One other interesting example from Netflix: one way to justify [slack] is a crisis that freaks the company out. You know: how do you sell insurance? Well, your house burns down and next time you'll buy insurance, right?

Netflix had an enormous outage, in 2012 on Christmas Day, when Amazon had a problem with one of its regions. I think they were out for 24 hours, and that got seared into the DNA of the company, and they invested enormously in being able to handle a regional Amazon failure. So now Netflix runs out of multiple geographical regions. They can do failover. So if there's a problem in one geographic region, they can actually move traffic to other ones. They test this out on a regular basis. Last time I checked it was like every two weeks or something like that. Um, it's really expensive. They have a lot of extra capacity. But it was justified because it was so painful when it happened before.

So the challenge is: how do you convince people to buy the insurance before the tragedy happens? And, you know, I wish I knew the answer to that because I don't. I think you have to have people high up in management who believe in this stuff. And so you have to win over minds…. There's this thing in resilience called the "Law of Stretched Systems”, which says that anytime you have additional capacity, it's eventually gonna get eroded over time. It'll get used up for other things. And so there's this constant fight to preserve that capacity. And I think... The best I can think of is to make as many people aware of this as possible. But really what happens is they don't invest in it. And then there's a big outage. The organization freaks out. They invest in it more. And then there's a cycle. The cycle kind of repeats.

Brian Marick:
Well, telling stories like that helps. You mentioned falling behind [when] communicating up the chain, there's a couple of links that are given in the paper to a financial firm that deployed the wrong software and couldn't get approval to stop trading for some ridiculously small amount of time – it was like 10 minutes I think. But in that 10 minutes they lost $500 million. That tends to catch people's attention, I guess.

Brian Marick:
Let's see. What else do we have? We have “boundaries discovered via surprise.”

Lorin Hochstein:
Boundaries is a popular metaphor in this world. The original boundary metaphor I think comes from Jens Rasmussen, a safety researcher, pretty active in the 70s and 80s. And he had this model, it's become very popular, where the system is a point in a phase space. And there are three boundaries. One of the boundaries is the economic boundary, where basically if the system gets too close to that boundary, the company's not making enough money, it's going to fail. And so there's a push for efficiency away from that boundary.

The other one is the workload boundary where the closer you get to that boundary, the more people have to work. And people don't like to work too much. So there's pressure away from that boundary for people to not have excessive workload. This is the efficiency/thoroughness tradeoff, where you have more and more work, so you don't spend as much time on each individual thing.

And then the third boundary is the safety boundary. When you cross a safety boundary, a bad thing happens in the system, it's unsafe. But the [safety] boundary is invisible. You don't actually know how close you are to that boundary. And so the idea is you test it out and see how much more efficient can we get and [still be] safe. The explicit compute resource metaphor [would be] how hot should you run your CPU? How much load should you expect a spike will be, things like that. You don't know what's gonna tip you over. And so you don't know exactly where to operate. So you sort of test it out.

Brian Marick:
I was just wondering... There does seem to be a lot of emphasis on things falling over and dying, whether immediately or after a period of degradation. Do people worry about – and should people worry about – gradual degradation that doesn't lead to collapse? Like: maternal death rates in the US have doubled from something like 1991 to 2011 or something like that. There's no spike. It's just a gradual, “oh, things are getting worse." Do people talk about that?

Narrator:
Let me interject here to correct what I said. It was from 1999 to 2019 that maternal mortality more than doubled in the US. No ethnic or racial group was immune. I link to the Scientific American article in the show notes.

Lorin Hochstein:
Resilience is mostly about acute problems. That's more of a chronic thing. The concern is generally where there's a chronic degradation that you don't notice until it's too late, and then it falls over. So like a memory leak is an example. Your headroom is getting smaller and smaller and smaller, but you don't necessarily notice. But then eventually it becomes a problem. So there’s an interest in that because the concern is always, well, eventually this chronic problem becomes acute. So there's interest in how do you detect when you're slipping like that?

“It’s not affecting the company”, so how much time should you spend looking at that, right? But then eventually it becomes a big problem, and then it’s too late.

There's always this tension… One thing that I found really useful when I was on the incident management team is they would look at anomalies. Okay, there was an error spike and it went away, system's fine, right? Well, what happened there?

When they would look at that, they would get practice with the tooling and they would understand, “Oh, that was because someone did a deploy, that server went down, and there was a retry.” And you actually learn a lot. So they call those “blipper doodles” when there was like a little anomaly thing: system's actually okay, but we don't understand why that happened. Do you investigate?

I tried to get people to look at those more. It's always a trade-off because there’s always a more economically valuable thing to do, but that's a way of actually building experience, expertise with the system: when you can diagnose those things.

The idea is that nothing is “fine”. I'm going to spend at least some amount of time looking into that. Maybe you'll want to time box it because sometimes you won't know, but I'm going to commit to investing some amount of time to looking at these things. I found it very valuable, and I think organizations that do that benefit a lot, but it's once again hard to justify.

Brian Marick:
Let's do one more, capacity for maneuver.

Lorin Hochstein:
Once you hit the crunch, once you had the crisis, how much flexibility do you have? What can you actually *do* in that situation? Here's an example. [I was] talking to someone after an incident at Netflix where they had diagnosed the problem by SSHing into the box and bringing up a REPL. They brought up a REPL in the production system – I had no idea you could even do that with that software – and [were] able to inspect the system and figure out [the problem]. You know, there are people who freak out at the idea of being able to run a REPL in production, but it actually gives you a lot of capacity for maneuver. It's a generic ability to go into the system and do things. Historically in Netflix, they've been good about letting software engineers SSH into the boxes. It gives them the capacity to make changes. The more you're able to actually change the system when you hit a crunch, the better off you are. [If] there's the humans who are around to do that and have the expertise and there's the resources you've invested in advance to enable them to be able to do that, [that’s] capacity for maneuver.

Brian Marick:
For people who might not be familiar with the jargon, a REPL is basically a command line of sorts.

Brian Marick:
Final question. Not everyone works at Netflix. People who work at smaller shops that nevertheless have systems that have to stay up, and have some of the properties of a Netflix-type system, how should they think about this? Because they can't do everything Netflix does.

Lorin Hochstein:
One thing is that we're all, regardless of our size in the software world, building distributed systems that fail in weird complex ways. You don't have to be Netflix-sized. If you've deployed something in the cloud, you depend on a whole bunch of different systems that you have no control over. And you're going to see weird complex failures. Large or small, that's going to happen. I think having the learning from incidents perspective and developing expertise in your systems, that is independent of the size of the system. I think it’s worth reflecting on your incidents as they happen and understanding how people actually solved the problems, what was hard, what was confusing.

You may not be able to build multi-region failover. You may not have the money for the compute capacity and hiring extra people. But you can fight for spending the reflection time to deepen your own expertise in the system. And I actually think that will, regardless of the size of the system, make you a better engineer.

A lot of this stuff [has] two parts. There's parts that are extremely specific. There are skills that I developed at Netflix that are going to effectively be useless to me anywhere else. But there's very general ideas about the nature of how humans interact with software and how complex systems fail that are forever things. So I think even at small companies, it's just as important and probably easier because you have more influence over the organization because it's smaller.

Brian Marick:
Would you recommend engineering a Christmas Day big catastrophe that will convince everybody to give you the budget needed?

Lorin Hochstein:
What I would say to that is: you don't have to plan to have a major outage. It'll happen, I promise.

Brian Marick:
Okay.

Lorin Hochstein:
You have to be ready, right? You want to be ready when the organization has this Christmas day existential crisis thing. Say, “Look, here's the plan. Here's what we're going to do.” So much of this stuff is timing. If you can wait for it and watch [for] the opportunity, you don't have to engineer it. I promise you it'll happen. But when it does, you should be ready, and you should be able to articulate why it's worth investing resources in the future.

Brian Marick:
I think that's an excellent place to stop. That's an inspiring ending. For me, it's an inspiring ending, unless you have something even more inspiring to say.

Lorin Hochstein:
No, I think that's the best I can do.

Brian Marick:
Well, this was a very pleasant conversation, and I guess it's over now.

E37: Resilience engineering with Lorin Hochstein

Broadcast by

headphones Listen Anywhere

Listen Anywhere