Jul 22, 2024 11:57:08 AM

Crowdstrike Outage - Customer FAQ

As it will take some time for CrowdStrike to complete Root Cause Analysis (RCA) and therefore for us to finalise our incident report, we have provided this FAQ to assist customers in communicating with internal stakeholders. We will produce an interim incident report as quickly as possible, which will provide more detail than this FAQ.

Frequently Asked Questions:

What failed in CrowdStrike and caused the outage?

The issue was caused when CrowdStrike released a faulty sensor configuration update to Windows systems. Sensor configuration updates are an ongoing part of the protection mechanisms of the Falcon platform. This configuration update triggered a logic error resulting in a system crash and blue screen (BSOD) on impacted systems.

Unlike regular software updates, these updates are released regularly to detect new threats and ensure devices remain protected, much like signatures in traditional antivirus products. The faulty update was downloaded automatically by devices running CrowdStrike. CrowdStrike has stated that the update was “designed to target newly observed, malicious named pipes being used by common C2 frameworks in cyberattacks”

The update was in the form of a “channel file” - this is simply a type of file CrowdStrike uses to contain updates.

This issue did not impact Mac or Linux devices running CrowdStrike.

Was I exposed to cyber security risks during this time?

The issue and the work around did not increase exposure to cyber security risks. Prior to deploying any workarounds we tested to ensure security functionality remained operable. The advisories released by CrowdStrike did not explicitly state whether CrowdStrike would continue to function as usual with the workaround in place, therefore we proceeded with caution.

CrowdStrike says they fixed it quickly, therefore why did the response take so long?

While CrowdStrike did revoke the faulty update relatively quickly, a lot of devices had already downloaded the update and had crashed with a BSOD. CrowdStrike released a corrected update, however as many of the impacted devices were not able to start normally, they were unable to download the updated fix.

These devices require manual intervention which can take 15-30 mins and require access to the affected device.

Don’t you have remote tools to fix these problems?

As affected systems weren’t functioning properly, we were unable to use remote management agents installed on these devices. This was due to devices either not loading at all, loading into “Safe Mode” where limited functionality is enabled to prevent further crashes, or devices loading and restarting before being able to connect to remote management tools.

Why was this not picked up in testing?

CrowdStrike are yet to comment on this. This is something we expect to learn more about once they complete their RCA.

We have a policy of running CrowdStrike software updates one step behind the latest version (N-1), however this doesn’t apply to content updates that are necessary to keep protections against threats current.

We have seen reports that this was the result of a programming error with Null pointers in the update

CrowdStrike has stated “This is not related to null bytes contained within Channel File 291 or any other Channel File”, however the RCA is yet to be complete so everything at this stage should be treated as speculation.

Has CrowdStrike made any statements on whether customers will be compensated for the outage?

CrowdStrike has not commented on compensation at this stage.

How long do you expect the recovery effort to take?

We prioritised recovering servers running your key systems, with most of this completed by the early hours of Saturday morning. We then moved to address end users with operationally impacting issues.

We are actively working on recovering end user devices now, and are making good progress. It is likely that a large portion of the issues will be addressed by the close of business today (Monday), however we expect recovering all end user devices could take some days. In particular where there might be complications due to remoteness or other technical issues impeding recovery.

How can this be prevented in future?

It is too early to say without the completion of the CrowdStrike RCA, however this outage enforces the need for good BCP plans, system reliance, and a defence in depth strategy with multi-layered security controls in case any one fails.

Have there been any security threats related to this?

While the faulty update didn’t create any security issues itself, there has been a subsequent rise in phishing attacks and CrowdStrike support look-a-like domains offering help, but with malware embedded. An example of this is: crowdstrikebluescreen[.]com

We have deployed these Indicators of Compromise (IOCs) to security platforms we manage for customers with those services as appropriate.

Does this impact my backups and updates/patching?

Due to the impacts to servers over Friday night / Saturday morning, a number of backups failed as the systems were offline. We have worked through failed jobs and reran these where possible, or they have picked up again on their regular schedule. Our BAU monitoring of backup jobs will pick up any failures and we will continue to address these if they have not already been remediated

Some update/patching schedules were deferred in consultation with affected customers to reduce the risk of any ongoing disruption and ensure systems stabilised without compounding change that can come from updates as well as the CrowdStrike outage. This scenario only applies to a few customers.