CrowdStrike releases a root cause study of the widespread Microsoft meltdown.

Experts believe CrowdStrike would be "very embarrassed" after delivering its Root Cause Analysis (RCA) of the defective software upgrade that caused arguably the worst worldwide IT outage in history.

It boiled down to a common mistake that first-year programming students are instructed to avoid.

On July 19, the tragic Blue Screen of Death (BSOD) Friday, about 8.5 million Windows systems worldwide fell into meltdown due to a faulty update for CrowdStrike's Falcon sensor device.

A few days after the incident, the US cybersecurity business issued a preliminary report.

A more in-depth 12-page examination has now established the source of the problem: a single undiscovered sensor.

Falcon's privileged access

CrowdStrike primarily provides ransomware, malware, and internet security products to enterprises and large organizations.

The broad outage has been related to its Falcon sensor software, which is used to detect threats and help shut them down.

According to Sigi Goode, an information systems professor at the Australian National University, Falcon had unusually privileged access.

It is located in what is known as Windows' kernel.

"It's sitting as close to the engine that powers the operating system as possible," Professor Goode added.

"Kernel mode is constantly watching what you're doing and listening to requests from the applications you're using, and servicing them in a way that appears seamless to you."

He described kernel mode as the traffic police that Falcon sits alongside, saying, "I don't like to look of that vehicle, we should take a look at it".

The sensor 21 culprit

CrowdStrike regularly updates Falcon.

On July 19, the business sent a Rapid Response Content update to specific Windows hosts.

CrowdStrike dubbed it the "Channel 291 Incident" in the RCA, and it involved the introduction of a new capability into Falcon's sensors.

According to Professor Goode, sensors are like "a pathway for evidence," telling it what type of suspicious activity to search for.

When an update is received, the location or number of sensors is changed to detect a potential attack.

In this case, Falcon expected the update to have 20 input fields, but it included 21.

CrowdStrike stated that the "count mismatch" was the cause of the global crash.

"The Content Interpreter expected only 20 values," the RCA report says.

"Therefore, the attempt to access the 21st value produced an out-of-bounds memory read beyond the end of the input data array and resulted in a system crash."

Because Falcon is so firmly linked into the heart of Windows, when it crashes, it brings down the entire system, resulting in the BSOD (Blue Screen of Death).

Professor Goode stated that one of the most prevalent ways to breach a system was to flood memory.

Essentially, you instruct the machine to search for something "out of bounds".

"It was looking for something that wasn't there," according to him.

How can this happen?

CrowdStrike has apologised for the failure which has led to its CEO, George Kurtz, being called to testify before the US Congress to explain what happened.

"We are using the lessons learned from this incident to better serve our customers," Mr Kurtz said in a statement released this week.

"To this end, we have already taken decisive steps to help prevent this situation from repeating, and to help ensure that we — and you — become even more resilient."

CrowdStrike's quality assurance (QA) practices have been called into doubt.

According to the firm, their upgrades "go through an extensive QA process, which includes automated testing, manual testing, validation, and rollout steps".

However, Rapid Response Content, which was used in this instance, follows a different method.

In the report, CrowdStrike admits that "lack of a specific test for non-wildcard matching criteria in the 21st field" contributed to "the confluence of these issues that resulted in a system crash".

Toby Murray, an associate professor at the University of Melbourne's School of Computing and Information Systems, described the "dodgy data file update" as "embarrassing".

This kind of mistake, shouldnt be happening , as a staged deployment process should have been in place prior to release.
CrowdStrike announced it had engaged with two independent software security vendors to conduct further review of the Falcon sensor code for both security and quality assurance.

Calls for accountability

In the wake of the outage, regulators and businesses have been considering legal implications.

The incident caused congestion at airports, stopped store check-outs, and made it difficult for media outlets to report the story.

The impact on Australian enterprises alone is anticipated to be more than $1 billion.

Innes Willox, CEO of the Australian Industry Group, told ABC's The Business that the damage cost from the error might be in the billions of dollars.

However, he added it was unclear if affected organizations would be able to seek reimbursement from CrowdStrike for any losses caused by the disruptions.

Delta Airlines announced last week that the downtime cost the company $US500 million ($760 million) and that it intended to sue the cybersecurity firm for compensation.

CrowdStrike has denied the allegations, stating in a letter from an external counsel that it is "highly disappointed by Delta's suggestion that CrowdStrike acted inappropriately and strongly rejects any allegation that it was grossly negligent or committed misconduct".

Delta canceled more than 6,000 flights over six days, affecting nearly 500,000 people.

The US Transportation Department is investigating why it took so much longer to recover from the outage than other airlines.

Key takeaways

The CrowdStrike IT outage, caused by a simple but catastrophic software update error, serves as a stark reminder of the importance of strict cybersecurity regulations and effective IT infrastructure management. Here are the key takeaways from this incident:

The Consequences of Poor Quality Assurance: A weakness in CrowdStrike's QA process resulted in a global outage, affecting millions of users and costing corporations billions. This stresses the critical significance of rigorous testing and validation in software updates, especially when working with sensitive systems like those included in an operating system's kernel.
The significance of incident response planning: The lengthy outage and widespread disruption demonstrate the value of having a well-planned incident response strategy. Quick recovery from IT failures is crucial for minimizing operational and financial consequences.
Legal and reputational risks: The incident has resulted not just in financial losses, but also in potential legal battles, as indicated by Delta Airlines' decision to sue CrowdStrike. This is a cautionary tale regarding the legal and reputational fallout from IT failures.
The Importance of Resilient IT Systems: The incident highlights the importance of having resilient IT systems that can withstand and quickly recover from unexpected difficulties, reducing the likelihood of extensive outages.

How GIGAMiT Can Help

Gigamit provides a comprehensive range of IT services to help organizations avoid the mistakes that resulted in the CrowdStrike outage. Here's how Gigamit can help your company:

Custom IT Solutions: Gigamit offers tailored IT solutions that are carefully built and properly tested to match your organization's specific demands, lowering the likelihood of similar difficulties happening.
Proactive Incident Response Services: Gigamit's incident response and crisis management services enable your company to quickly recover from IT disruptions, reducing downtime and operational damage.

By partnering with Gigamit, your business can build a more secure, resilient IT infrastructure that is better equipped to handle the challenges of the modern digital landscape.

# Tech-News

How to Prepare yourself against a Ransomware Attack