
Let’s take a look at a couple of different orgs I have had the pleasure to work with and their response to “problems”. I am reminded of my practices in Football or marching bands. We practice in small sections (clarinets, flutes, etc), then come together to rehearse as a larger group.
That allows us the opportunity to inspect our small unit and then see how it integrates across the whole. If we can’t do well in our small sections, we obviously will struggle at large.
In one instance, a Friday storm (a real thunderstorm) knocked out a regional data center hosting a primary supplier’s order management system. In one ecosystem, leaders had practiced a quarterly drill; the partner liaison paged the on-call, traffic was rerouted to an alternate node, a fallback SFTP feed replaced the real-time API, the joint Slack channel switched to a crisis template, and the time-to-recover clock stopped at 3 hours.
In a second one, our organization worked with various other departments. Each department worked its own playbook; procurement waited for legal, engineering waited for procurement, stores went out of stock, and customer care improvised for two weeks, as they often do.
The difference was not talent; it was rehearsal, shared roles, and a simple scorecard that made resilience measurable.
What?
Resilience is a capability you can build; in ecosystems, it requires clear roles, repeatable drills, and a one-page scorecard that everyone reads the same way; hardly easy.
Crisis scenarios to rehearse
Supplier failure simulation, primary partner offline; trigger alternate supplier or capacity plan.
Routing swaps, fail primary data feed to a known fallback, API to SFTP, DC A to DC B, partner A to partner B.
Incident communication test, joint Slack or Teams template, prewritten customer and stakeholder updates, single status page, decision thresholds that move quickly.
Contract fallback, execute a pre-negotiated temporary route or alternate certification lab; time-boxed and cost-capped.
Roles across companies
Incident Commander, one person makes the call; rotates by quarter; escalation tree is public.
Partner Liaisons, one per firm; they coordinate decisions and timestamps.
Comms Lead, drafts internal and customer updates; coordinates legal and PR.
Ops Lead, executes routing and platform changes; confirms rollback plan.
Ecosystem Scorecard, reviewed monthly
Time-to-Recover (TTR), clock from incident start to restored service at customer boundary.
Alternate Capacity Readiness (ACR), percent of load that can be moved within X hours; verified by last drill.
Shared Backlog Burn-down During Crisis, the rate at which cross-company issues are cleared; shows if the network is working as one.
Customer Impact Lag, time between service restored and measured customer impact recovered, conversion, contact rate, NPS.
Decision Latency, time from exception detection to cross-company decision; often the true bottleneck.
Joint Defect Escape at the Seam, quality signal for integration during and after the event.
Use Process Behavior Charts on TTR and decision latency so you react to signals, not noise; cite our guidance on center line and natural limits. For investment decisions, use Throughput Accounting; quantify contribution lost per hour of outage, then compare to the cost of automation, cross-training, or an alternate supplier; a small recurring cost that removes high outage hours is usually a strong bet.
Anchor sponsorship with a change management coalition idea; hold the cadence; short cycles, clear roles, lightweight artifacts.
So What?
Resilience is a strategic advantage, not overhead. In ecosystems, unpracticed teams create long decision queues, finger-pointing, and expensive heroics; customers feel the churn, and trust erodes.
Practiced networks reduce time-to-recover, lower expedited costs, and protect revenue because decisions move fast and in one rhythm. A scorecard makes trade-offs explicit; leaders stop funding shiny dashboards and start funding the boring things that save hours when it counts, alternate capacity, automation of failovers, shared runbooks, and joint drills.
Now What?
Run a 30-day resilience upgrade that any set of our teams can copy (obviously, track the metrics that make sense to you).
Week 1: publish the scorecard and roles
Draft a one-page scorecard with TTR, ACR, decision latency, customer impact lag, and joint escape rate; define how each is timestamped across firms.
Assign roles: Incident Commander, Partner Liaisons, Comms Lead, Ops Lead; publish the escalation tree and on-call schedule.
Set a quarterly drill on the calendar; name the scenario right now.
Week 2: rehearse the smallest drill
Choose one seam; e.g., fail the real-time inventory API to the fallback feed for 30 minutes; run the comms template; record all timestamps.
Debrief in 20 minutes; log what surprised you; update the runbook and scorecard definitions.
Week 3: Elevate one constraint
Use the timestamps to find the bottleneck; most ecosystems discover decision latency or data failover.
Prepare a one-page Throughput Accounting pitch, current TTR, cost per outage hour, option A low-cost automation, option B cross-training, option C alternate supplier qualification; expected TTR delta; 2-week trial; ask for a small, time-boxed investment.
Week 4: drill again and lock gains
Re-run the same scenario; compare TTR and decision latency on a PBC; if you see a signal, update the baseline; if not, adjust the fix and try once more.
Fold any new guardrails into the Operating Level Agreement with partners, data failover SLOs, decision thresholds, message templates, and owner names.
One-page scorecard template
Objective, safeguard conversion and trust during incidents.
Metrics, TTR, ACR, Decision Latency, Customer Impact Lag, Joint Escape Rate;
Limits and Targets, PBC center lines and natural limits, plus target ranges where applicable;
Drill Cadence, quarterly scenarios with named owners;
Improvements, last three changes and their measured effects.
Facilitator scripts you can use
“We will rehearse before we need it; small drills, short debriefs.”
“Decisions beat drama; if we cannot decide in 10 minutes, we escalate (only once).”
“We fund the cheapest hour we can save; show me the outage cost and the option that removes it.”
Let’s Do This!
Ecosystem agility is learning plus impact that crosses company lines; practice it before you need it.
Run one drill a quarter; publish a scorecard everyone can read at a glance; invest in the hour-savers that reduce time-to-recover; update your operating agreement as you learn.
When the next storm hits, your customers should notice quick recovery and quiet confidence, not chaos and apologies. That is resilience you can bank on.