By Chethana Janith, Jadetimes News
SINGAPORE - A software glitch that unleashed global chaos on July 19, grounding many essential services and causing massive confusion at airports and hospitals, raises the question: Why were there no backup plans?
American cyber security firm CrowdStrike, the company responsible for crashing millions of Microsoft Windows computers from Australia and Asia to Europe and the United States, has a lot to answer for.
But it is also time for businesses to reflect on whether they have sufficiently avoided the risk of depending too much on a single vendor.
To quickly recap, the carnage on July 19 came after CrowdStrike pushed out a defective software update to all its customers. Many computers turned up a “blue screen of death”.
The update file contained the latest malware signatures that must quickly reach all end point devices, such as laptops, mobile phones, and webcams to sufficiently protect the devices against the latest threats.
Unlike the usual security software patches that can be tested by customers’ information technology departments (IT) before mass deployment, the malware signature files go straight to the end point devices. So there is no way the IT departments can test for operational disruptions.
Thus, the onus falls on the vendor, in this case, CrowdStrike, to do the necessary testing.
“It is like customers would expect a fire extinguisher to be tested and for it to work the moment it needs to be activated,” said Mr. Ian Loe, a Singapore based tech veteran of almost 30 years with experience managing large technology teams.
“Seems like in its rush to push out an update, CrowdStrike did not sufficiently test the update, or it would have discovered the flaw before rolling it out to customers,” he added.
CrowdStrike, like rivals SentinelOne, Trend Micro and Palo Alto Networks, is popular as its technology can automatically detect and lock down threats.
In the aftermath of the incident, Trend Micro asserted that it is important software updates are pushed out quickly, but the firm takes steps to mitigate the risks of its updates being buggy.
“We take a ring deployment approach that allows us to roll out software updates in batches starting with our internal deployment, and then to groups of customers to limit exposure if issues are found.
“Additionally, we have blue screen of death monitoring and operational capabilities to roll back affected builds rapidly,” Trend said on its website.
CrowdStrike was nothing short of apologetic after causing the massive outage. Its chief executive George Kurtz told NBC News that the company was “deeply sorry” for the incident and “would make sure every customer is fully recovered”, although it would take time.
Still, the firm, and Mr. Kurtz especially, has a lot to answer for.
The snafu on July 19 is reminiscent of a similar incident in 2010, when cyber security firm McAfee, where Mr. Kurtz was chief technology officer, rolled out faulty software updates that froze computers from Australia and Europe to the US.
The update mistakenly identified a critical Windows file as a worm and quarantined it, crashing tens of thousands of computers.
Australian supermarket chain Coles had to temporarily close several stores in the country after the McAfee update brought down 1,100 checkout terminals. Chip giant Intel, the Kentucky State Police in the US and hospitals in Rhode Island were among those affected.
Comparatively, the fallout from the July 19 incident is far greater, given that many more organisations have rapidly digitalised since 2010. And since CrowdStrike’s offerings are tailored for large organisations, including governments, banks, airlines and healthcare institutions, the severity of the damage was also much higher.
Compensation for customers looks to be in order, but the exact payout depends on the sort of contracts customers signed. Insurers could also face a raft of business disruption claims.
As bean counters tally the damage, the outage also exposes how essential services around the world have put all their eggs in two baskets: Microsoft and CrowdStrike.
It is also shocking that many organisations had no business continuity plans and had to shut down or return to manual paperwork. On July 20, snaking queues were still seen at AirAsia’s counters at Changi Airport, with passenger check ins done manually. Why are there no backup computer systems?
The worst might not be over, even as businesses spend the next few days manually correcting crashed computers. Leaders of the most affected companies will need to deal with questions about their backup plans and vendor diversity, to restore public and investor confidence.
What is certain is that the growing complexity and interdependence of online systems will mean more potential points of failure or a protracted service recovery if anything goes wrong. So there is no shortcut.
As Mr Loe aptly put it: “Robust testing is understated in today’s complex and highly connected tech world. When something goes wrong, it can be very serious.”