[German]On July 19, 2024, a faulty update of the CrowdStrike Falcon EDR software – specifically a file for a driver – caused millions of Windows computers worldwide to fail with a blue screen, while macOS and Linux systems were not affected. During the night I saw the first clues about the internals on X and readers also pointed out the facts in comments. I'll summarize the results of various analyses in the following article.
Advertising
Review of the incident
On July 19, 2024, Windows systems on which the EDR software CrowdStrike Falcon was installed experienced outages worldwide (from the early hours of the morning). CrowdsStrike Falcon is a widely used enterprise detection and response (EDR) protection software for end devices. During operation, software updates are regularly rolled out by means of so-called channel files. Crowdstrike uses the channel files to distribute dynamic updates and detection rules.
On July 19, 2024, the vendor rolled out an update for the CrowdsStrike Falcon sensors worldwide. As a result, a blue screen of death (BSOD) (with the error message PAGE_FAULT_IN_NONEPAGED_AREA) was triggered on the affected systems. Responsible was the file csagent.sys and a faulty sys file distributed by update, which was loaded by this driver.
This was the point at which a large part of the Windows ecosystem went into "Blue Friday" mode. Computers running Windows that were configured to automatically restart after a fatal error would fall into a BSOD restart loop. Systems that remained in BSOD showed the famous blue screen seen in the tweet above.
As a result, millions of Windows computers were affected worldwide and the IT infrastructure of companies failed. Airports in Australia (Melbourne), India, Hamburg and BER in Berlin came to a standstill. Banks, police, fire departments, cash register and merchandise management systems with retailers, but also many companies were affected – nothing worked anymore. An incomplete list of affected locations can be found in this tweet. I came across an analysis that CrowdStrike serves 15% of the global market for security solutions.
Advertising
It is now clear that this was the largest computer breakdown ever recorded worldwide. The damage caused by the outage runs into the billions and some of the systems will take some time to repair. On the morning of July 19, 2024, I had promptly prepared the findings in the articles Worldwide outage of Microsoft 365 (July 19, 2024) and Windows systems throw BSOD due to faulty CrowdStrike update. There were also given early indications of various workarounds to fix the trigger of the blue screen.
Analysis: An empty file causes BSOD
It quickly became clear that the file with the pattern C-00000291*.sys was responsible for the failure of the Windows systems. Kevin Beaumont wrote: "The .sys files causing the problem are channel update files that cause the top-level CS driver to crash because they are invalidly formatted. " These files are located in the Windows directory:
C:\Windows\System32\drivers\CrowdStrike directory
Deleting these files was enough to solve the problem. The provider CrowdStrike also provided corresponding information quite quickly (see also my article above).
I stumbled across the above tweet on X, where someone wrote that the faulty file only contained zeros. But there are other tweets like this one, where a file with content is shown – and some of these .sys files also triggered the BlueScreen. I then came across more tweets with analyses on X. Patrick Wardle posted this tweet about it.
The driver file CSAgent.sys causes a crash and subsequently the BlueScreen, because an attempt is made to map to an invalid address in memory. Wardle writes that the content of 'C-00000291-…32.sys' etc. probably contains obfuscated data.
Zach Vorhies has tweeted also some "insights", but the conclusion seems not be true. Tavis Ormandy has analyzed the CrowdStrike driver CSagent.sys. CSagent.sys crashes exactly at offset 0xe35a1, where the bytes "45 8b 08" are located. This machine code translated into assembler is the command mov r9d, [r8]. Immediately before this command there was a NULL check ( test r8, r8; jz ), see tweet no. 6 in his thread.
Tavis Ormandy assumed that invalid pointers were read in a loop (from the faulty config file *291*.sys) and the previously uninitialized variables (target addresses) were filled with random data garbage and were not filled with correct data due to the incorrectly read pointers.
This could be the reason why the crash address differs for different people, because it results from the uninitialized random junk data. The mov command then accesses illegal (and, depending on the computer, variable) memory areas and the kernel interprets this as a fatal error.
The bug had existed for some time but was never noticed because valid start addresses were always supplied from the .sys file. With a Windows kernel driver, such attempts to access invalid memory areas always lead to a blue screen. Another analysis may be found on X within this tweet (and subsequent tweets).
Only Windows was affected
This outage only affected Windows systems in the corporate sector – private individuals do not generally use EDR systems for monitoring. Some companies have paid a lot of money to be optimally protected against cyber attacks by the CrowdStrike Falcon EDR solution. It has now proved fatal that Windows has a virtual monopoly position in companies.
The CrowdStrike Falcon EDR solution is also available for macOS and Linux. The above-mentioned consequences did not occur there. But at the end of April 2024, Debian Linux 12 experienced a kernel panic in connection with the Falcon sensor version 7.10 to 7.14 after updating to kernel version 6.1.0-20. Official workaround: uninstall and wait for new version or run software in "user mode".
At this point, there will (hopefully) be discussions about responsibilities and liability for software errors and the necessary consequences will be drawn. A situation in which no one is liable for software errors only exists in IT.
There are restore options in the BTRFS or ZFS file systems on Linux. There is no such thing in Windows, where the administrators have to boot the machines in WinRE mode, enter the Bitlocker recovery key if necessary and then delete the faulty driver files.
There are approaches to automate something like this, and I had posted another workaround via the CrowdStrike management console in the article Windows systems throw BSOD due to faulty CrowdStrike update. There is also the vague suggestion that after rebooting 15 times, the Windows systems were also running again (possibly the system then managed to pull an update to the correct .sys signature file).
There is also the suggestion to boot the machines under WinRE with a network connection in order to have the update installed automatically. However, I cannot say whether this really works.
CrowdStrike has now issued an official statement on the incident – when the bug appeared, I was unable to view the CrowdStrike engineering messages – because these can only be viewed by customers with a user account – case of is not possible. By the way, Morten Knudsen has this summary of how to get your Windows systems up and running again in various scenarios.
Similar articles:
Worldwide outage of Microsoft 365 (July 19, 2024)
Windows systems throw BSOD due to faulty CrowdStrike update
Why numerous IT systems around the world failed due to two errors on July 19, 2024
CrowdStrike analysis: Why an empty file led to BlueSceen
Nachlese des CrowdStrike-Vorfalls, der bisher größten Computerpanne aller Zeiten
Advertising