The Stratus Downtime Prevention Buyers Guide discusses the 6 questions you should be asking to prevent downtime – including server failures. The guide recommends posing questions such as “in the event of a server failure, what is the process to restore applications to normal processing operation?” and “how long does it take?” The Stratus Downtime Prevention Buyers Guide also compares the different levels of downtime that can be expected with specific systems. When you are working with an availability solutions provider, it’s important to establish which system will provide the fastest recovery time. Or best yet, which system will ensure that your customers don’t even realise a server has gone down.
There have been many high profile outages in recent years … The supposedly safe hands of Amazon Web Services (AWS) experienced a 4 hour outage in 2017. Many service providers use AWS as their back end supplier, Netflix for example. So whilst 4 hours to restore such a massive system is reasonably impressive, those 4 hours were very expensive hours indeed. AWS prepared to react to failures but did not consider preventing the outage in the first instance.
Machines are machines – servers are servers. No matter how good they are, they will develop faults or completely fail. Even a Rolls Royce has an airbag because a car crash is always a possibility.
Avoiding Downtime – Your Options
If you rely on standalone servers, your recovery time could range from minutes to days. That’s because of the high level of human interaction required to restore the applications and data from backup – provided you’ve been backing up your system on a regular basis.
With high availability clusters, processing is interrupted during a server outage and recovery can take from minutes to hour s. Again this depends on how long it takes to check file integrity, roll back databases and replay transaction logs once availability is restored. If the cluster was sized correctly during the initial planning stages, users should not experience slower application performance whilst the faulty server is out of operation; however, they may need to rerun some transactions using a journal file once normal processing resumes.
Fault tolerant solutions proactively prevent downtime with full replicated components that eliminate any single point of failure. Some platforms automatically manage their replicated components, executing all processing in lockstep. Because replicated components perform the same instructions at the same time, there is zero interruption in processing – even if a component fails. This means that, unlike a standalone server or high availability cluster, the fault tolerant solution keeps on functioning while any issue is being resolved.