A recent article by Mark Cavage provided some illuminating thoughts on building distributed systems (‘There is No Getting Around It: You Are Building a Distributed System’, Mark Cavage, CACM June 2013, pp 63-70). Cavage’s view is that distributed systems – systems spanning more than one machine – are ‘difficult to understand, design, build and operate’. He states that ‘They introduce exponentially more variables into a design than a single machine does, making the root cause of an application problem much harder to discover’.
He points out that problems often appear not as hard failures but as degraded performance or intermittent errors. Many readers – and your correspondent! – will have experienced the frustration of attempts to solve these problems. Even worse, they may go away when we try to investigate them – the term ‘Heisenbug‘ has been coined to describe this phenomenon. (Named after the physicist Werner Heisenberg, who noted that observing a physical system alters its state.
What can we do about creating better distributed systems, given the increasing role they are likely to play in future? Cavage highlights the importance of sound software engineering principles. That’s hard to dispute but, as he states, ‘Too many applications are being built by stringing together a distributed database system with some form of application and hoping for the best. Instead – as with all forms of engineering – a methodical, data-driven approach is needed.’
He stresses the importance of architecture, advocating service-oriented architecture (SOA) as solid foundation. He then goes on to sketch out an approach to building an example system, using SOA. I would add one point he does not mention but perhaps implies: good people are required. Development processes make skilled people more productive; they don’t make unskilled people perform.
Is there something else we can do to improve the situation? I think there is – in fact, it’s already known, as I will try to explain.
While Heisenbugs do appear in mainframes, their incidence is much less common. There are two good reasons. First, mainframes contain all of the components within a single machine. Communication between components or services is within the control of a single OS instance – and mainframe operating systems such as OS 2200 and MCP are pretty good at managing what’s going on in the machine.
Secondly, and I believe of the greatest importance, is the integrated stack of hardware and software delivered by mainframe vendors. Most or all of the pieces are designed, built and tested by the vendor before delivery. They therefore work properly, providing a solid foundation for applications using them. I’ve written before of the importance of the integrated stack in a blog and a white paper. The notion of integrated stacks is spreading widely; vendors not traditionally thought of as mainframe suppliers have adopted the approach.
But, given the spread of distributed systems, how does the integrated stack help in their construction? I think the answer lies in the fabric infrastructure. The heart of a fabric is a set of loosely-coupled servers interconnected by high-speed communications technology such as Infiniband. That is not sufficient, however. The structure needs to be securely housed and delivered as an integrated stack, comprising all the hardware and software elements.
I would identify two software components as being of critical importance. The first is the software providing the interworking between applications within and between the component servers. The second is systems management, especially automation and testing tools, to ensure that the whole infrastructure is as reliable and sound as possible.
It will never be possible to eliminate all the Heisenbugs that may infest distributed systems. But the fabric infrastructure approach should go a long way to improve the situation by providing a solid platform on which to build applications.