Analyzing the CrowdStrike Update Outage: Insights and Lessons from Maxine Holt on the Impact and Future of Cybersecurity Practices
On July 19, 2024, a critical update to CrowdStrike’s Falcon agent led to a significant global IT outage, affecting numerous systems running Microsoft Windows and causing widespread disruption. To gain deeper insights into this incident, we sat down with Maxine Holt, Senior Director of Cybersecurity Research at Omdia. With her extensive experience in analyzing cybersecurity trends and incidents, Maxine provided valuable perspectives on the factors contributing to the outage, the effectiveness of the response, and the broader implications for cybersecurity practices. In our conversation, she discussed the challenges faced by various industries, the lessons learned, and how this event may shape future cybersecurity strategies and vendor relationships.
Can you provide an overview of the events leading to the global IT outage caused by CrowdStrike’s update? What were the main factors that contributed to the widespread impact of this outage?
Software vendors frequently provide updates to their products, sometimes to the whole product, sometimes just internal files. However, on 19 July, one such update to internal files caused issues, with machines running Microsoft Windows were “bricked” en masse. This means that the operating system detects an unrecoverable failure and is designed to shut the system down rather than risk further damage. In this case, the failure came from CrowdStrike’s Falcon agent as it failed to process an internal file.
How common are defects in software updates, and why did this one have such a catastrophic effect?
Although rare, defects in software updates do happen. Vendors have internal testing as part of the regular software development process. We don’t yet know what the testing regime was at CrowdStrike prior to the deployment of this update. However, this update had a big effect because of three things: a) as a security tool, this CrowdStrike software needs to run kernel mode. This mode can essentially be summarized as highly privileged access to the system and its resources. The recovery for kernel-mode failures is much more laborious than non-kernel mode failures. b) It appears at this point that the failure condition could be triggered in a relatively frequent manner, c) CrowdStrike is widely deployed in many enterprises, so the failure affected numerous systems.
What role did the dependency on cloud services play in the extent of the disruption?
More than the dependency on cloud services, it was the fact that the recovery process requires accessing the affected machine (real or virtual) in what’s referred to as “Safe Mode”. Doing that requires special administrative passwords and direct access via console. This is difficult to do in physical systems and more laborious than usual in virtual systems.
What immediate steps should organizations take when facing an IT outage of this magnitude?
This is one of the cases where the organization needs to quickly triage and determine if it should enact its business continuity plans. This will vary by organization and may be done in a per region or per application basis. It’s very likely – and recommended – that part of those plans includes proper communications with customers, partners, and other stakeholders.
How effective, in your opinion, were the mitigation actions taken by CrowdStrike and Microsoft in this scenario?
Microsoft investigated the issue and quickly ascertained a third-party update to be the cause. According to the CrowdStrike website, the problem was identified very soon after the update was released, and the sensor configuration update causing the problems was remediated in 78 minutes. Incident detection and response was reasonably quick, but it wasn’t enough to limit the damage. As one of our analysts mentioned it, this was a bit of a “perfect storm”: a catastrophic failure of a component that runs in privileged mode across a wide swath of enterprise devices, and the fix to that failure requires laborious administrative access to each affected system.
Microsoft was clear that it wasn’t their responsibility, and that is understandable. CrowdStrike quite rightly focused on the remediation in the first instance and was quick to share immediate technical guidance on how to fix the issue. That’s positive, of course.
Which industries were hit hardest by the outage, and what were the specific challenges they faced?
We can pick up on financial services. Although we are still assessing the true impact, but it impacted 100s, if not 1000s of banks. It affected several segments of financial services including the ability to access vital banking services, send/receive payments or even trade in financial markets. The UK and European banking sector were already reeling from the outage the day before (on Thursday) that hit the Swift network, impacting high-value and time-sensitive transactions and specifically the Chaps systems in the UK which is used for house purchases. This follows similar outages which hit the UK’s faster payments system in June which meant several people received their salaries late. In terms of response, it was largely manual. IT functions needed to find workarounds for the problem, and IT executives reported that if a reboot didn’t fix the issue, a physical presence to update machines manually was the only option. Cloud customers also had issues around rebooting in safe mode.
Many IT professionals will have worked long into the night and the following weekend to gradually bring systems back online. Even as systems return, the ripple effect of the outage will result in other delays and business impacts, including non-security-critical patch deployments, software enhancements, and backlogs of other business improvement initiatives.
Overall, every industry was affected because nearly every industry is digitally dependent, and Windows is a key OS for many industries. Digital dependence demands digital resilience, and this resilience wasn’t available on 19 July.
How did the outage impact essential services such as healthcare, banking, and transportation?
See response to Q6 for FS. For healthcare, in the UK, for example, many professionals were unable to access patient records. Of course, these systems are critical to patient health, and outages have a potentially life-threatening impact here. The focus of information security is to ensure the confidentiality, integrity, and availability (CIA) of information and systems, and on 19 July many organizations in multiple industries did not have the availability needed.
What long-term effects do you anticipate for businesses affected by this outage? How might this incident influence future decisions regarding the use of cloud services and cybersecurity solutions?
Cybersecurity technology is an essential component of the people, process, and technology triumvirate needed to protect the CIA of digital information and systems. As such, the need for this will not be diminished. The immediate aftermath means that perhaps CrowdStrike will have to work (much) harder to convince prospects that their cybersecurity capabilities are worth the investment. Other cybertech vendors will have to work hard to convince their prospects that the same wouldn’t happen with their software. Longer term, deployment of kernel mode updates will need to be reviewed, and organizations will need support for “staging posts” for these updates, to test them before deployment. In other words, the vendor supplies the software to a staging post, the staging post tests, then deploys to the broader fleet.
What are the key lessons that the cybersecurity industry should learn from this incident?
This wasn’t a cyberattack, it was a software update that realistically, any software vendor delivering updates that affect software running in kernel mode – cybersecurity or otherwise – could have introduced. As such, it’s the tech industry that needs to learn from this, not just the cybersecurity industry. There’s something to be said for a “shared responsibility” model for kernel mode updates that should be considered (the “shared responsibility” model currently applied mainly to cloud environments), supporting customers in their use of technology.
In your opinion, how will this incident shape the future of cybersecurity practices and vendor relationships? What steps should the cybersecurity community take to prevent and better handle such widespread disruptions?
The desire – and need – for vendor consolidation in the cybersecurity industry will continue. However, Omdia has spoken to a CISO whose comment encapsulates the requirement: “Consolidating with fewer vendors means that any issue [such as the CrowdStrike one] has a huge operational impact. Businesses must demand rigorous testing and transparency from their vendors.”
The obvious statement for the cybersecurity community is to have the operational practices that support testing updates from vendors before deploying automatically, particularly to systems identified as mission critical. And the obvious statement for the tech vendors is to thoroughly test updates before deployment. However, these are obvious statements and there’s a lot more to be done to determine responsibilities between vendors and customers for these updates.