We all saw the news on the British Airways’ system failure this Bank Holiday weekend – resulting in thousands of flight cancellations and frustrated, angry passengers.
British Airways CEO, Alex Cruz, of course apologised but denied the problems were due to thousands of IT job cuts. He is more than likely correct. The apparent cause of the problem – a power surge on servers hosting some of their critical systems – would not have been prevented by more IT staff.
Mr Cruz explained that a power surge at 9.30am on Saturday morning affected the company’s messaging systems as well as operational applications. The real problem was that their ‘backup system’ failed to work properly. Backup system? In our opinion, systems with such high levels of criticality should not rely on ‘backup’ or standby/failover systems. These kind of ‘react to failures if they occur’ technologies are 20 years out of date because when a problem occurs, by their very nature, they REACT to the problem rather than PREVENT it in the first place. Even worse, when the backup servers or systems also fail to react to the failure, then there is an absolute system-wide catastrophe with long term downtime (aka British Airways this weekend).
More staff may have speeded up the time to ultimately resolve the issue – but this could more than likely have been avoided in the first place. Why would a standby system fail to work? Very commonly, failover to the standby system is not regularly tested. But why accept a system in the first place that will result in a period of downtime when the main server/s fail?
The only way to protect against such hardware failures (which are inevitable from time to time) is to deploy FAULT TOLERANT protection of critical systems. What does that mean? In the very simplest form, two servers work in real time as a single processing resource to prevent rather than react to outages – so whatever happens on server A also happens on server B at the same time. If you lose server A (due to a power surge for example) then everything continues to run on server B with zero downtime and zero data loss. By having separate power feeds and UPS protection to these separate servers, there is zero impact when a power surge hits one of the supplies to the computer room and/or data centre.
Fault tolerance can protect multiple systems to the very highest levels of resilience. So a major cause of passenger frustration – lack of communication about the IT problem – is also resolved because both critical operational AND passenger messaging systems can be deployed on the same platform.
It’s a simple approach which is surprisingly cost effective, even before an IT meltdown. Take into account the inevitable cost of compensation, loss of new business and damaged reputation that British Airways has suffered this weekend – and that makes it an absolute steal.