Your MTTR Is Your Culture- Chaos Engineering and Antifragility in Modern Cyber Operations
operational resilience chaos engineering antifragility MTTR cyber security
Last night at a RANT forum in Manchester, the conversation inevitably
turned to the recent wave of UK cyber incidents, notably Marks &
Spencer’s ransomware attack earlier this year and Jaguar Land Rover’s
(JLR) factory-stopping cyberattack over the summer.
In the same discussion, people referenced this week’s Cloudflare outage
which turned out to be a configuration error in their bot-management
stack that took a sizeable chunk of the internet offline for several
hours, including services like X and PayPal, before being contained and
remediated.
On the surface, these events all sit under the banner of “cyber
incidents.” But the timelines tell a more important story:
Cloudflare: major global outage, largely stabilised
within hours, with a public post-mortem and corrective actions already
in motion.
Marks & Spencer: ransomware attack leading to
online orders being halted, contactless payments disrupted and digital
channels degraded for months, with a projected ~£300m profit impact and
full digital recovery stretching into late summer.
Jaguar Land Rover: a major ransomware-driven
cyberattack forcing a proactive global shutdown of IT systems and
manufacturing, halting production for five to six weeks, straining
thousands of suppliers and now estimated as one of the most expensive
cyber incidents in UK history.
These are not just security stories. They are operational resilience stories. And the key differentiator is not whether something went wrong – it’s how prepared the organisation was to fail, and how quickly it could recover when it did.
Your Mean Time To Recover (MTTR) is no longer a pure engineering metric. It is a direct proxy for the culture, discipline and investment behind your operational resilience strategy.
From “Secure” to “Resilient”: A Shift in
Mindset
Most organisations still talk about cyber in prevention-first
language:
“We need to stop attacks.” “We need to close vulnerabilities.”
Prevention is essential, but it is no longer sufficient. Modern
attackers, complex supply chains and highly coupled cloud architectures
mean that material incidents are a question of when, not
if.
Regulators have already internalised this. Operational resilience
regimes – from UK financial services regulation to the EU’s Digital
Operational Resilience Act (DORA), all converge on the same
principle:
Identify your important business services.
Set impact tolerances (how long you can realistically
be down).
Prove you can operate within those tolerances under stress.
That last point is where many organisations struggle. The gap between
“we have a plan” and “we have rehearsed this failure mode end-to-end
under realistic conditions” is where days and weeks of unplanned
downtime are born.
Antifragility: Not Just Surviving Incidents, Learning From
Them
The language of resilience is about withstanding shocks. The language of
antifragility goes further: systems get better because they are
stressed, not in spite of it.
In a cyber and infrastructure context, antifragility looks like this:
- Each incident leaves your monitoring, runbooks and architecture genuinely stronger.
- Post-mortems are blameless, widely shared and linked to hard design changes – not shelfware slide decks.
- On-call engineers are trained, rotated and supported, not burnt out and blamed.
- Leadership accepts controlled experimentation and failure as a cost of learning, not something to be hidden.
Cloudflare’s handling of its outage is not perfect, no major incident
response ever is, but it shows antifragile characteristics: rapid
detection, global rollback mechanisms, transparent communication and a
detailed technical post-mortem published within days.
Contrast that with scenarios like M&S and JLR, where recovery has
dragged on for weeks or months, core digital or manufacturing services
were taken fully offline, and the commercial and macro-economic impact
runs into hundreds of millions or even billions.
The technical root causes differ, social-engineering-driven identity
compromise in retail, supply-chain-wide OT/IT dependence in automotive
manufacturing, configuration bugs in a cloud security edge.
The lesson is the same:
Antifragile organisations treat failure as a rehearsed operating
mode, not an exceptional event.
Chaos Engineering: The Missing Discipline in Cyber
Resilience
If antifragility is the mindset, chaos engineering is
one of the core practices that operationalises it.
Chaos engineering is the deliberate, safe injection of failure into
live-like environments to validate that:
- Detection works as expected (alerts fire in time and reach the right teams).
- Automated mitigations and fallbacks function (circuit breakers, rate limits, WAF/CDN rollbacks, failovers).
- Manual runbooks are usable under pressure (clear steps, crisp ownership, unambiguous escalation paths).
- Customer and partner impact stays within your stated impact tolerances.
For cyber and infrastructure teams, chaos experiments might include:
- Simulating the loss of a core identity provider or SSO integration.
- Intentionally degrading a payment or order-capture service and measuring time-to-detect and time-to-mitigate.
- Injecting WAF or CDN misconfigurations in a canary environment to test whether monitoring and guardrails catch them before global rollout.
- Running red-team/blue-team exercises that don’t stop at “initial compromise”, but run all the way through to restoration of business services and closure with regulators.
This is not about theatrical tabletop exercises or once-a-year crisis rehearsals. It is continuous engineering work, backed by telemetry, SLOs and clear success criteria – the same discipline we apply to performance and availability.
MTTR as a Cultural Mirror
Most post-incident reviews obsess about the trigger:
Was it a phishing email? A third-party compromise? A mis-applied configuration change?
These are important, but they are rarely unique. What really distinguishes one organisation from another is what happens after first detection:
How long did it take to make a risk-informed decision to contain? How quickly could we isolate, segment and route around the failure? How well did we communicate with customers, staff, suppliers and regulators? How fast did we return to a safe, sustainable steady state?
This is where MTTR becomes a cultural metric:
Low MTTR usually indicates rehearsed playbooks,
empowered incident commanders, strong cross-functional drills and a
high-trust culture where engineers are not punished for surfacing issues
early.
High MTTR is often a symptom of fragmented ownership,
brittle architectures, manual workarounds, decision paralysis and a fear
of making visible trade-offs without perfect information.
You can’t “dashboard” your way out of this. The journey from multi-week disruption (M&S and JLR) to multi-hour containment (Cloudflare) runs through culture, operating model and investment, not just tooling.
What Leaders Can Do Now
If you are a CISO, CIO or engineering leader, there are pragmatic moves
you can make immediately:
Re-frame the narrative from “security” to “operational resilience”.
Tie cyber incidents directly to important business services and regulatory impact tolerances. Make the conversation about continuity of service, not just compliance or “patch velocity”.Institutionalise chaos engineering for cyber.
Start small: quarterly game-days on one critical service, run jointly by SRE, security, and the business owner. Measure detection times, decision latency and recovery – and treat those metrics as seriously as you treat revenue or NPS.Make MTTR a first-class KPI – and treat it as a learning signal.
Track MTTR across incident classes (identity, data, application, platform, vendor). Look for structural patterns: where are we consistently slow to recover, and why? Invest in automation, simplification and training where it has the biggest impact on MTTR.Run blameless post-mortems and close the loop.
Every material incident should result in:- Concrete engineering changes (architecture, telemetry, automation).
- Updated runbooks and decision frameworks.
- Lessons shared across teams and suppliers, not confined to a single silo.
Align early with operational resilience regulation.
Whether you sit under PRA/FCA regimes in the UK or DORA in the EU, use those frameworks to drive clarity: map important business services, define realistic impact tolerances, and test against them before the next M&S or JLR-style incident forces the issue.
Closing Thought
Cyber incidents and outages are no longer rare shocks, they are part of
the normal operating environment of any digital business, from retailers
to global manufacturers to cloud platforms. The question is not whether
something will fail, but how your organisation behaves when it
does.
If your operational resilience strategy is built purely on “keeping the bad thing out”, you will continue to be surprised, slow and visibly fragile when the inevitable occurs.
If instead you start to embed antifragility and chaos engineering into how you design, deploy and operate your systems, your MTTR will start to tell a different story: one of preparedness, composure and continuous learning.
That is the signal your customers, regulators and shareholders are really looking for.