The hurricanes afflicting the east side of the USA are the most recent in a series of natural disasters this year. There have been earthquakes in New Zealand, floods in Australia and, worst of all, the tsunami in Japan.
Aside from the human consequences, in highly developed societies with their dependence on IT services, the disruptive effects are severe. Data centres can be destroyed. Even if they survive, the infrastructure services on which they depend – power, telecoms, water – are likely to be disabled, probably for extended periods. And it’s not just wide-spread natural disasters that can cripple IT systems. Localised events such as a fire or loss of power can have the same effect.
Those responsible for providing the services must therefore have comprehensive plans for maintaining them in the event of any disaster. Legislation or regulation may insist on disaster recovery (DR) plans for some sectors such as banking or emergency services. Even where there are no external requirements, common sense points strongly to the need for a DR strategy.
Or so you would think. I have been surprised by the apparent lack of concern in some cases. The degree of negligence varies. At its most extreme – unique, I hope – I found one organisation with no plans at all. When I asked why, I was told ‘your systems are very reliable – they don’t fail’. Since we were talking about ClearPath systems, that’s true. But they’re not waterproof, fireproof, or, sadly, bomb proof.
More commonly, the plans may be insufficiently comprehensive. And where they are well defined, they are may not be adequately tested or even testable. I know of one example, for instance, where there was a complete plan but it took so long to execute – two or three days – that it simply could not be tested. Its chances of working were about zero.
So what should we do? Here are three recommendations.
First, have a comprehensive plan. This seems blindingly obvious although apparently not to everyone. The plan should of course match the business needs of the organisation concerned. Not everyone needs to recover a thousand miles away in one second. But everyone without exception needs a plan.
Secondly, automate the process as much as possible. While disasters do occur, they are not that frequent. Expecting operators to execute unfamiliar, complex scripts reliably when under great pressure is asking for trouble – they will make mistakes. The two-to-three day recovery process I mentioned was subsequently reduced to less than 30 minutes by automation.
Finally, test the process regularly to make sure it works. The tests should be realistic, perhaps pulling the plug unexpectedly to see what happens. High levels of automation help – more frequent tests can be carried out if the time to execute them is short.
A number of ClearPath users do have comprehensive plans in place and have followed something like the recommendations I made. The tools are available – for example Operations Sentinel, Business Continuity Accelerator, XTC, and OpCon/xps from partner SMA – to implement any level DR required. Combined with the inherent reliability of the systems, heavily-automated DR can all but eliminate downtime for critical users.
I’ve written more on the subject in a White Paper called Unisys ClearPath Systems Management: Maximizing IT service availability.