When Agile Meets Adversity: Lessons from the 2024 CrowdStrike Outage

Disclaimer

I don't have all the facts of what has happened in the last few days regarding Crowdstrike and Microsoft. I simply wanted to use it as an example of what I find very common in the world I work in; being agile and being strong in testing.

It is not my intent to assume who failed and what they should've of done; much smarter people than me are on the case. I think its a great example for my topic though, so let's dive in.

The Crowdstrike Incident

Friday, July 19, 2024 05:27 UTC, a critical issue at CrowdStrike led to a global outage impacting multiple industries. The problem originated from a faulty update to the CrowdStrike Falcon sensor software deployed on Windows machines.

"Customers running Falcon sensor for Windows version 7.11 and above, that were online between Friday, July 19, 2024 04:09 UTC and Friday, July 19, 2024 05:27 UTC, may have been impacted. Systems running Falcon sensor for Windows 7.11 and above that downloaded the updated configuration from 04:09 UTC to 05:27 UTC – were susceptible to a system crash."

This update caused systems to enter recovery mode or display the Blue Screen of Death (BSOD), significantly disrupting operations across various sectors, including airlines, banks, and retailers​.

If you work in the technology space for any length of time, you have likely been a part of some urgent catastrophe in your career. Most professionals are trying to do their best to guard against product downtime, defects, security threats, and hacking / leaking data...

As our available tools improve, we get better at putting these protective measures in place faster (which also means bad actors have better tools to do their bidding faster as well). I suppose any profession is at risk of the cat and mouse game of bad actors and people doing their best to provide great products, but it feels exponential in the technology space.

It's Me, Hi, I'm the Problem, It's Me

Technology and Product Development is a hyper-competitive space. Launch first (or earlier than others) or languish. Executives and Business Stakeholders alike are under enormous pressure to perform at a business level which leans into the culture of the organization. Some leaders handle this well, others...do not.

What if we the professionals are our own worst enemy, going faster than what they can to satisfy the business? What if our leadership is the worst enemy, creating a culture of go fast or else?

Finger pointing isn't very helpful. It is simply important to determine what happened and try to find ways to mitigate it without creating a bunch of ossified and calcified processes that rarely get re-evaluated; slowing the teams down unnecessarily. Worse is when a situation like this happens, a recalcitrant auditing authority is appointed who rarely considers the actual risk of an issue vs. the mitigating work that might create more waste than the risk demands.

You've seen these processes right? Someone claiming that a developer can't deploy to production, it has to go through a separate group (usually a group who knows least about the change going to production LOL).

Another one claiming that because of XYZ Compliance Policy, we can't do ABC. Dig into the actual verbiage of a governance before just listening to an auditor (whose sole job is to simply identify risk, not necessarily care about the mitigating effect on productivity). I digress though, let's refocus on my real intent of this post.

So What...

The focus here is that sometimes our own people are our own worst enemy. The recent events of Crowdstrike seem to point to an internal deployment or update of some kind that clearly demonstrated the affect cloud computing has had on all industries because it caused issues. Of course the first question...did anyone test that the update worked? If I had to bet, they did, but if they didn't, they don't deserve to be serving so many of the industries of the world anyway.

Being agile has a great relationship with quality. The more we spend on quality up front, likely the less risks there are to our deployments and the faster (more effective) we go.

Go fast! But go right...(Festina lente).

Well, sometimes we just aren't sure the right balance. If I am building an online movie theatre ticketing product and I get something wrong...whoops, the worst case is that you miss your movie. Perhaps you get popcorn delivered to your seat instead of Junior Mints...whoops! Worse, you didn't even get your food (which you probably didn't need anyway, you're welcome).

If I am making a pace-maker or missile defense avionics, that whoops is not acceptable. Therefore we might build iteratively and incrementally, but have much more testing than we can do in an iteration or Sprints. We might even spend Sprints solely focused on hardening or stabilization testing that simply couldn't be accomplished in our 1-4 week Sprint (yes I think that is ok in this example because of the testing contract required, more on that later), however, work is still visible on a product backlog and built-into the overall release plan (or release burn-down).

Scrum (a way to be agile, not THE way) says the feature has to be usable by the end user so they can validate it (proving it has gone through some testing, works as intended, and doesn't break other things). It used to also focus on the word "Potentially Releasable" meaning it should be able to go to production by the end of the Sprint.

It doesn't say it has enough value to release it (the Product Owner has final word on releasing and yes, we should be so good we can release at any time, but most teams have to work for some time to get there).

Key Agile Practices to Mitigate Risks

In Scrum, it's generally discouraged to have backlog items that focus solely on testing. Instead, Scrum promotes an integrated approach where testing is a part of the development process for each Product Backlog Item. The Definition of Done in Scrum should include criteria that ensure testing is part of completing any item.

This means that each item should be fully developed, tested, and integrated within the Sprint. This could ultimately help us reduce hardening and stabilization activities. You have to start somewhere though and sometimes having them visible on the backlog is a good way to start learning how to eventually evolve them into our definition of done (shift left).

Other practices that can help:

  • Continuous Integration and Continuous Deployment (CI/CD): Regular integration of code changes and automated deployments must be paired with comprehensive automated testing to detect issues early. The only real overhead then is breaking the work small enough to actually be usable and deployable in a Sprint.
  • Test-Driven Development (TDD): Writing tests before developing new features ensures that code meets quality standards from the outset.
  • Cross-Functional Teams: Including security experts in agile teams ensures that potential vulnerabilities are considered and addressed throughout the development process.
  • Task Breakdown: The team should break down their work into smaller tasks on the Sprint Backlog, including development and testing tasks. These tasks can then be managed within the Sprint, ensuring that testing is not an afterthought but an integral part of the workflow.

Balancing Deployment Frequency and Testing Rigor

A key takeaway from the incident is the need to balance frequent deployments with rigorous testing, especially for critical systems. Not all applications require the same level of testing rigor as we explored earlier. If you have a single point of failure that will bring down airlines, banks, and others in one fail swoop, you need that rigorous testing or re-evaluate if one single point of failure is the only way.

  • Automated Testing: Implementing extensive automated test suites allows for rapid yet reliable deployments can allow you to start doing your rigorous testing closer to the front of the line rather than a huge amount at the end.
  • Incremental Hardening: Prioritize security enhancements and deploy them incrementally to maintain a secure production environment while continuing to deliver new features. This includes having environments that mimic the products you support.
  • Risk-Based Testing: Tailor the level of testing based on the product's risk profile, ensuring that critical systems receive the highest scrutiny.

Wrap It Up...

The CrowdStrike incident underscores the importance of quality and security in an agile environment. Agile does not ignore testing contracts (many people believe agile is a process that says "just throw it out there and see if it works").

Agile principles advocate for rapid deployment and iterative development, but these must be balanced with robust testing and quality assurance to prevent such widespread failures.

Speed and flexibility are critical in a hyper-competitive market, they should not come at the expense of quality and security. Agile teams must incorporate robust testing, continuous monitoring, and cross-functional collaboration to deliver secure and reliable software. Leaders must foster a culture where it is ok for people to say slow down and shout when there's too much risk.

Who knows best how to balance these things? Usually your team does. Go ask the team!

Join us for one of our workshops!

If you want to learn more about Scrum and Agile, join us for an upcoming workshop where we explore in detail some of the challenges facing teams and organizations in regards to business agility.

Register