When AI Goes Wrong: How AI Automation Alone Wasn't Enough for Amazon AWS.

Did You Know?

In October 2025, AWS experienced a major outage in its US-East-1 region. The event, which lasted roughly 15 hours and impacted thousands of companies and millions of users, was caused by a DNS (Domain Name System) resolution failure in the Amazon DynamoDB service.

At the same time, AWS is under pressure to scale its compute infrastructure massively (some sources estimate a $100 billion investment). Yet, it is reportedly shrinking parts of its workforce, specifically engineering/operational heads.

The argument? Some companies (in this case, a major one) are jumping on the “AI will replace human workers” bandwagon. However, when an infrastructure failure happens, you realize that human knowledge, context, experience, and resilience still matter deeply.


So What?

From the perspective of business transformation, this incident raises several implications for organizations aiming to utilize AI, optimize operations, adopt agile/lean practices, and build resilient systems.

1. Human + AI, not AI alone
This article emphasizes that replacing skilled workers with AI or automation prematurely can weaken an organization’s resilience. When something novel fails, machines often cannot cope without human judgment, institutional knowledge, or experience. The AWS example illustrates this: a DNS automation bug multiplied because the human fallback and institutional continuity were diminished.

2. Reliance on infrastructure = risk in operations
Even for an industry giant like AWS, an internal automation defect caused broad failure across multiple dependent systems (DynamoDB → EC2 → many services).

For companies that adopt AI, machine learning, data mining, and generative AI, etc., the dependence on complex infrastructure (such as cloud, data pipelines, and model operations) increases. Without a robust operational design (including failure modes, human oversight, and residual competence), the risk grows significantly.

3. Data & automation maturity matters
Many AI tools are unreliable in complex, novel tasks. For organizations undertaking data mining, AI adoption, or agile transformations with AI-enabled teams, recognizing the current limitations of AI is crucial. You must focus on the maturity of your data, your processes, your operational guardrails, and the human context.

4. Outcome-over-output mindset in AI transformation
Business value, early impact, minimal cost, and reduced friction are paramount in business. The AWS incident reminds us that even when output (automation, AI rollout) appears efficient, the outcome (resilience, reliability, business continuity) may suffer if human and operational risks are not managed.

5. Strategic workforce & capability planning
AWS may have reduced human capacity (engineers, operational staff) while ramping up automation. Whether or not the causality is confirmed, the timing and effect raise organizational questions: do you have the right mix of talent, institutional memory, humans in the loop, and automation? If you lean too heavily on machines without the human scaffold, you risk operational fragility.


Now What?

Organizations leveraging AI, data mining, agile, and lean transformation, there are the actions and considerations based on the lessons from AWS.

Actions for organizations

  • Conduct an AI-and-automation readiness assessment:
    • Map critical operational flows (e.g., data pipelines, model deployment, cloud infrastructure, business processes).
    • Identify where automation/AI is currently applied or planned, and assess what human oversight, institutional knowledge, or fallback exists.
    • Evaluate failure scenarios (what if automation fails? what if the model hallucinates? what if the data pipeline breaks?) and ensure human-in-loop mechanisms are established.
  • Design for humans + machines:
    • Adopt a “centaur” model for AI: human experts paired with AI tools.
    • Ensure training programs emphasise human judgement, domain expertise, anomaly detection, and escalation protocols when automation does not behave as expected.
    • For product teams, data/AI teams, operations teams: create cross-functional squads that include data scientists, operations engineers, domain experts, and reliability specialists, not just AI technologists.
  • Embed resilience & operational excellence into AI deployment:
    • Build in manual fallback or human override paths for critical systems.
    • Ensure system monitoring, alerting, and incident response paths include humans with domain context.
    • Use lean and flow principles: minimise handoffs, detect flow interruptions early (e.g., pipeline broken, model drift, automation misfires), and reduce cycle time for incident recovery.
  • Prioritize high-value, low-risk AI experiments:
    • Given that many corporate AI initiatives generate zero or low returns (as some studies say), focus first on use cases where the outcome is clear, the data is solid, and human oversight is feasible.
    • Leverage agile product development principles: run experiments, gather feedback, learn quickly, scale when value is proven.
  • Workforce strategy aligned with transformation goals:
    • Don’t treat workforce reduction purely as cost-cutting by replacing humans with AI. Instead, plan for workforce upskilling (AI-literate jobs, data-ops, human-in-loop operations).
    • Retain institutional knowledge, document systems, train new staff, and keep “tribal knowledge” visible. The AWS article points out that when key expertise departs, resilience suffers.
    • Use change management, leadership alignment, and a culture of continuous learning, in line with your Agile coaching background.

Catalyst Leadership Questions

  1. Are we treating AI as a substitute for humans, or a partner with humans?
    Probing question: Where in our operations have we shifted tasks to AI/automation without ensuring human oversight or a fallback?
  2. What happens if our AI or automation fails? Do we have human competence and continuity in place?
    Probing question: Can we identify the top 3 risks if a key automation pipeline fails, and who would be responsible for responding?
  3. Is our team’s capability aligned for AI-enabled workflows , including data scientists, operations engineers, reliability specialists, and domain experts?
    Probing question: Who are the humans currently maintaining our model/data systems, and what knowledge would they leave with them if they left?
  4. Are we prioritizing AI projects based on business value, rather than hype?
    Probing question: Of our planned AI initiatives, which have clear measurable business outcomes and which are “because we can”?
  5. Do we design for flow, feedback, and resilience rather than just speed and efficiency?
    Probing question: When an AI release goes wrong, how quickly can we detect it, stop it, remediate it, and learn from it?

Let's Do This!

We’ve examined a high-profile example where even the most powerful cloud-infrastructure company experienced an event that highlights the limits of assuming AI and automation can replace human knowledge and expertise. For organizations pursuing transformation with AI, it’s a reminder: focus on value, resilience, human-in-loop, operational excellence, and the flow of outcomes, not just output. 

Picture this: you’re running a factory and swap out most human operators for robots overnight. The robots execute well for a while, then one robot’s side fails unexpectedly. Without human oversight, the line grinds to a halt, costing you hours and thousands of dollars. That’s just what happened in the cloud world.

Let’s apply these lessons to the real world of data, AI, product teams, and business operations, because “automate and forget” is not a strategy for sustainable transformation.