On Being Engineers: Deploying Systems
Late last year, I wrote a piece — On Being Engineers — about the poor quality of many of the IT applications we have to use, and what we might do about it. I argued that we need to take systems engineering seriously. It’s not sufficient just to apply engineering discipline to software implementation. It needs to include commissioning the system in the first place and deploying the result. I looked at commissioning last time; I’ll now consider deployment.
Poor quality deployment will degrade an otherwise well-constructed system. It therefore matters that we apply rigorous engineering standards to the activities involved: initial installation and ongoing operation, providing resilience, and handling updates and new releases.
Start with installation. With a bit of luck, the system will have been put through its paces by the development team. What then needs to be done depends on the scale of the application. For critical systems there’s a lot of work.
A suitable hardware configuration has to be selected, able to sustain the anticipated load and cope with exceptional conditions. Traffic surges, for example, can bring a system to its knees. In some cases, we know that surges can occur but with little or no warning. A natural event such as an earthquake, for instance, puts emergency systems under pressure; they must be able to cope immediately. Anticipated surges, where the timing is known in advance, present fewer problems, although apparently not to everyone. Amazingly, there are cases where a surge must have been anticipated months or even years in advance yet the system instantly fell over when it happened. Opening the booking for a major sporting event is an example.
And before the system can be exposed to the outside world, there are likely to be many other activities such as training and publicity.
Turn now to resilience. Systems must be able to respond to a wide variety of failures, up to and including complete loss of the live system. The speed of response and the consequent cost must of course be consistent with the potential impact of failure. Critical systems may only be able to tolerate a few minutes per year out of action. The equipment, personnel and operational procedures must be able to cope.
After cutover, we have to handle updates and new releases to fix problems and provide additional features. Applying changes must be done with little planned downtime and, critically, minimal unplanned disruption. If a new release is made overnight, users the next morning do not want to find it doesn’t work.
Unsurprisingly, the platforms in which critical applications are deployed have a significant impact on the success of deployment. Unisys ClearPath systems are designed for these environments.
Metering allows systems to be configured with a lot of excess capacity to absorb traffic surges immediately, but at a cost of the average use, which is typically much lower.
The systems are reliable and secure, reducing the risk of disruption from failure and security breach. Extended Transaction Capacity (XTC) for Dorado systems, where multiple partitions are clustered with shared databases, not only increases overall performance but provides additional resilience. Partitions can fail; the remainder absorb the load, delivering uninterrupted service. The Business Continuity Accelerator (BCA) for Libra systems enables rapid recovery to another platform following any system or environmental failure.
ClearPath systems are designed to allow many changes and upgrades without taking the system down. XTC and BCA also play a role in reducing and in some cases eliminating planned downtime. XTC allows partition-by-partition upgrade while maintaining service. With BCA, production can be rapidly transferred to another platform containing a new release while maintaining a secure fall-back position to the original system.
Finally, success depends on disciplined operation, thorough training and practice of critical functions such as disaster recovery (DR). High levels of automation are essential. Operation, especially procedures such as DR, is error-prone if left to manual action. Operations Sentinel and OpCon from SMA provide facilities for routine and abnormal condition automation across multiple platforms as well as ClearPath systems.
None of this is magic. It’s just good engineering practice combined with the right platforms for the job.