What Breaking Down at Mile 47 Taught Me About Incident Response

The shear happened quietly. No warning light, no dramatic noise — just a subtle change in how the rig tracked on the trail, and then a soft, wrong feeling under the front wheels. By the time I stopped and got out, I was three miles past the last point where I'd had cell service, the sun was two hours from setting, and the front differential was done.

I've led incident response on production systems under active fraud attacks. I've managed unplanned RIFs overnight. I've had roadmaps dissolve in a single leadership meeting. None of it felt quite like standing on a dirt road in the eastern Cascades, watching the golden hour light work against you.

But the process, it turns out, is the same.

Stop Adding Inputs to a System You Don't Understand Yet

The first instinct when something breaks — on the trail or in production — is to act. To try things. To move. The rig felt wrong, so I wanted to keep driving and see if it sorted itself out. When a service starts returning 500s, the first instinct is to restart it, redeploy the last build, change something.

Both are the same mistake. You're adding variables to a system you don't have a model of yet. You're making the post-mortem harder to write and the recovery harder to execute.

The philosophical discipline here is recognizing that urgency is a feeling, not a fact. Epictetus: "Men are disturbed not by the things which happen, but by the opinions about the things." The sun was going to set at the same time whether I panicked or not. The incident was going to require the same root cause analysis whether I restarted the service three times or zero times. Stop. Observe. Build a model before you move.

The philosophical discipline here is recognizing that urgency is a feeling, not a fact.

Inventory Before You Act

Recovery gear exists for the moment when you need it and can't think clearly enough to improvise. I had the gear. Recovery boards, a full-size spare, basic tools, water for two days, a Garmin inReach. I knew where everything was before the breakdown happened — which meant that when it happened, my hands could work while my brain was still catching up.

In incident response, this maps to runbooks. Not because runbooks solve every problem — they don't — but because they free your cognitive load for the parts of the problem that are actually novel. The steps you rehearsed at 2pm on a Tuesday are the ones you can still execute at 2am when the adrenaline is running and the Slack channel has 40 people in it generating noise.

The teams I've seen struggle most in incidents are the ones whose runbooks are either nonexistent or untested. They exist as documents. They've never been walked through under pressure. The gear is in the truck but nobody knows which bag it's in.

Communicate Early, Communicate Calmly

I had satellite comms. I used them — not to declare an emergency, but to send a position and a situation report to the person who knew my route plan. Front diff issue at mile 47, likely out past dark, working the problem, here are my coordinates. That's it. No panic, no escalation beyond what the facts warranted, but also no delay.

This is the hardest part of incident response for most engineering leaders. There's a pull toward containing the information — giving yourself time to figure it out before you have to explain it. But the cost of a late escalation almost always exceeds the cost of an early one. Your stakeholders don't need you to have the answer. They need your current best model of the situation and what you're doing about it.

Calm is contagious. So is panic. The tone of your first incident update sets the temperature of every conversation that follows.

Work the Problem in the Right Order

On the trail: stabilize, then diagnose, then recover. Don't try to fix the diff until you're not in a dangerous position. Get the rig somewhere flat. Check your situation before you check the hardware.

In production: stop the bleeding before you find the root cause. If you have a lever that limits impact — rate limiting, feature flags, traffic routing — pull it before you understand why you need it. Understanding can come second. Customers can't wait for understanding.

The instinct to fully diagnose before acting is often backwards. You have two tracks running in parallel: mitigation and investigation. They inform each other, but mitigation comes first when users are affected.

The Post-Mortem Is the Point

I made it out. Slow-rolled the last three miles with the front diff disengaged, got back to pavement as the last light died, had the rig on a flatbed an hour later. The breakdown was expensive and inconvenient and, in retrospect, completely preventable. I'd been ignoring a noise for two trips that I now recognize as the early warning sign of exactly what failed.

That's the post-mortem writing itself. Not blame — I made the call to keep running on a diff I should have inspected. Process. What did I know, when did I know it, what would a different decision at that point have changed?

The same question applies every time a system fails under you. Not who did this but what did the system make easy that should have been hard, and what made the warning signal easy to ignore? The goal isn't a clean incident history. It's an organization that gets better at reading the early signs.

Seneca wrote: "Every new beginning comes from some other beginning's end." He was writing about time and mortality, but it holds for mechanical failures and production outages equally well. The breakdown is not the failure. The failure is not updating the model.

Update the model. Check the diff. Write the post-mortem before you forget what it felt like to be wrong.