10 January 2026

Designing Business Operations That Don't Rely on Heroics

Why MSP environments reward restoration over prevention, and how to design operations that fix root causes without heroics.

it-operations msp root-cause service-management problem-management governance leadership

Root causes don't persist because people don't care. They persist because the system rewards restoration, not prevention.

If you have ever worked in an MSP-led IT environment, this pattern will feel familiar.

The same incidents recur. The same fixes are applied. Service is restored, everyone moves on, and nothing really changes.

For a long time, I thought this meant someone was not trying hard enough. In reality, it usually means the system is working exactly as designed.

This post is about why root causes so often go unaddressed in MSP environments, and how an IT Operations Manager can fix that without becoming a hero, a bottleneck, or a burnout case.

The myth of heroic IT

Most organisations do not explicitly ask for heroic behaviour. It emerges quietly.

An incident happens. Someone stays late to fix it. The business is grateful. Normal service resumes.

The next time it happens, the same person is expected to step in again. Over time, they become the safety net. The organisation feels stable, but only because someone is absorbing risk on its behalf.

Heroic IT feels productive. It feels necessary. It is also one of the most effective ways to prevent an organisation from learning.

Why MSP environments struggle with root causes

This is not about bad suppliers or lazy engineers. It is about incentives.

SLAs reward speed, not durability

Most MSP contracts measure things like response time, resolution time, and availability. Once service is restored and the ticket is closed, the work is considered complete.

A permanent fix that prevents future incidents is often slower, riskier, and outside scope.

From the MSP's point of view, stopping at restoration is commercially rational.

Root cause work sits in a grey zone

True root cause correction usually involves change. Configuration standardisation. Documentation. Architectural cleanup. Dependency mapping.

That work is neither pure support nor clearly defined project delivery.

If it is not explicitly funded or prioritised, it becomes optional. Optional work rarely survives a busy service desk.

Improvement work is invisible

Incident resolution is visible. A ticket closes. A dashboard turns green.

Preventative work is quiet. When it succeeds, nothing happens. No outage. No alert. No thank you.

In environments driven by urgency, invisible work loses.

Everyone assumes someone else owns prevention

The business assumes IT has it covered. IT assumes the MSP will fix it. The MSP assumes the issue is accepted as part of the environment.

Nobody is being negligent. Ownership is simply unclear.

The hero trap

This is where well meaning IT managers get stuck.

They know the root cause. They know how to fix it. They know the business will not fund it easily.

So they squeeze it in. Late nights. Quiet changes. Personal effort.

The system never feels the pain, so it never changes.

Over time, the manager becomes indispensable, exhausted, and frustrated. The organisation becomes dependent without realising it.

This is not leadership. It is risk concentration.

The shift from effort to structure

Solving this does not require more energy. It requires better design.

Separate restore from improve

Incident management exists to restore service. That is its job.

Problem management exists to remove causes. That must be treated as planned, deliberate work.

When improvement is expected to happen inside BAU support, it will always lose.

Fund prevention explicitly

Even a small allocation changes behaviour.

Five to ten percent of MSP capacity. A modest monthly remediation budget. A rolling backlog of known problems.

Once prevention is paid for, it becomes legitimate.

Make recurrence visible

Do not obsess over single incidents. Track patterns.

What breaks repeatedly. How often. With what impact.

Frequency multiplied by impact tells you where to focus. It also gives you an evidence base that does not rely on anecdotes or emotion.

Force trade offs into the open

When a permanent fix is declined, record it.

Document the risk. Note the expected recurrence. Capture who accepted it.

This is not about blame. It is about clarity.

Once risk is explicit, it stops living in someone's head.

Use governance instead of heroics

Service reviews should not just report what happened. They should decide what happens next.

Which recurring issue are we eliminating this month. Which risks are we consciously accepting. What work are we choosing not to do, and why.

Cadence beats effort every time.

What changes when you do this

Things get quieter.

Incidents reduce. Escalations become rarer. The MSP starts suggesting improvements rather than reacting to noise.

IT feels boring in the best possible way.

Boring IT is predictable, resilient, and trusted. It does not rely on individuals stepping in at the last minute. It improves because the system makes improvement inevitable.

The kind of IT worth running

The goal is not perfect systems.

It is systems that get better over time without burning people out.

If your IT operation only works because someone is being heroic, it is fragile by definition.

Design the incentives. Create the feedback loops. Make the trade offs visible.

Do that, and root causes stop being a personal burden and start becoming an organisational responsibility.

That is where calm, sustainable IT actually begins.

Comments load on request because GitHub may set cookies. See the privacy policy.

Remember my choice

Previous ← Year-end notes: homelab resolutions and the backlog I am finally tackling Next Self-Hosted GitLab CI/CD: Building My Own Deployment Platform →