«AN EXPERIMENTAL EVALUATION OF THE ASSUMPTION OF INDEPENDENCE IN MULTI-VERSION PROGRAMMING* John C. Knight Nancy G. Leveson Department of Computer ...»
AN EXPERIMENTAL EVALUATION OF THE ASSUMPTION
OF INDEPENDENCE IN MULTI-VERSION PROGRAMMING*
John C. Knight Nancy G. Leveson
Department of Computer Science Department of Computer Science
University of Virginia University of California
Charlottesville, Virginia, 22903 Irvine, California, 92717
(804) 924-7605 (714) 856-5517 ABSTRACT N-version programming has been proposed as a method of incorporating fault tolerance into software.
Multiple versions of a program (i.e. ‘‘N’’) are prepared and executed in parallel. Their outputs are collected and examined by a voter, and, if they are not identical, it is assumed that the majority is correct. This method depends for its reliability improvement on the assumption that programs that have been developed independently will fail independently. In this paper an experiment is described in which the fundamental axiom is tested. A total of twenty seven versions of a program were prepared independently from the same speciﬁcation at two universities and then subjected to one million tests. The results of the tests revealed that the programs were individually extremely reliable but that the number of tests in which more than one program failed was substantially more than expected. The results of these tests are presented along with an analysis of some of the faults that were found in the programs. Background information on the programmers used is also summarized. The conclusion from this experiment is that N-version programming must be used with care and that analysis of its reliability must include the effect of dependent errors.
Keywords and Phrases:
Multi-version programming, N-version programming, software reliability, fault-tolerant software, design diversity.
*This work was sponsored in part by NASA grant number NAG1-242 and in part by a MICRO grant cofunded by the state of California and Hughes Aircraft Company.
1. INTRODUCTION Multi-version or N-version programming  has been proposed as a method of providing fault tolerance in software. The approach requires the separate, independent preparation of multiple (i.e. ‘‘N’’) versions of a piece of software for some application. These versions are executed in parallel in the application environment; each receives identical inputs and each produces its version of the required outputs. The outputs are collected by a voter and, in principle, they should all be the same. In practice there may be some disagreement. If this occurs, the results of the majority (assuming there is one) are assumed to be the correct output, and this is the output used by the system.
Separate development can start at different points in the software development process. Since each version of the software must provide the same functional capability, there must exist some common form of system requirements document. Coordination must also exist if the versions are to provide data to the voter, especially if intermediate data is compared as well as the ﬁnal output data. Obviously, all design speciﬁcation must be redundant and independent for the versions to have any chance of avoiding common design faults. An interesting approach to dual speciﬁcation was used by Ramamoorthy et al.  where two independent speciﬁcations were written in a formal speciﬁcation language and then formal mathematical techniques used to verify consistency between the speciﬁcations before the next step in development proceeded. Thus they were able to detect speciﬁcation faults by using redundancy and then repair them before the separate software versions were produced. Kelly and Avizienis [3,4] also used separate speciﬁcations for their N-version programming experiment, but the speciﬁcations were all written by the same person so independence was syntactic only (three different speciﬁcation languages were used).
N-version programming is faced with several practical difﬁculties in its implementation such as isolation of the versions and design of voting algorithms. These difﬁculties have been summarized comprehensively by Anderson and Lee  and will not be discussed here.
reliability. It is assumed in the analysis of the technique that the N different versions will fail independently; that is, faults in the different versions occur at random and are unrelated. Thus the probability of two or more versions failing on the same input is very small. Under this assumption, the probability of failure of an N-version system, to a ﬁrst approximation, is proportional to the N’th power of the probability of failure of the independent versions. If the assumption is true, system reliability could be higher than the reliability of the individual components.
We are concerned that this assumption might be false. Our intuition indicates that when solving a difﬁcult intellectual problem (such as writing a computer program), people tend to make the same mistakes (for example, incorrect treatment of boundary conditions) even when they are working independently.
Some parts of a problem may be inherently more difﬁcult than others. In the experiment described in this paper, the subjects were asked in a questionnaire to state the parts of the problem that caused them the most difﬁculty. The responses were surprisingly similar.
It is interesting to note that, even in mechanical systems where redundancy is an important technique for achieving fault tolerance, common design faults are a source of serious problems. An aircraft crashed recently because of a common vibration mode that adversely affected all three parts of a triply redundant system . Common Failure Mode Analysis is used in critical hardware systems in an attempt to determine and minimize common failure modes.
If the assumption of independence is not born out in practice for an N-version software system, it would cause the analysis to overestimate the reliability. Recent work  has shown that even small probabilities of coincident errors cause a substantial reduction in reliability. This could be an important practical problem since N-version programming is being used in existing crucial systems and is planned for others. For instance, dual programming has been used in the slat and ﬂap control system of the Airbus Industrie A310 aircraft . The two programs are executed by different microprocessors operating
greater than a deﬁned threshold causes the system to disconnect after a preset time delay. On the A310, it is sufﬁcient to know that there has been a failure as backup procedures allow the continued safe ﬂight and landing of the aircraft. Dual programming has also been applied to point switching, signal control, and trafﬁc control in the Gothenburg area by Swedish State Railways . In the latter system, if the two programs show different results, signal lights are switched to red. Dual programming has further been proposed for safety systems in nuclear reactors. Voges, Fetsch, and Gmeiner  have proposed its use in the design of a reactor shutdown system which serves the purpose of detecting cooling disturbances in a fast breeder reactor and initializing automatic shutdown of the reactor in case of possible emergency. Also, both Ramamoorthy et al.  and Dahll and Lahti have proposed elaborate dual development methodologies for the design of nuclear reactor safety systems.
A common argument [2,10,12] in favor of dual programming is that testing of safety-critical realtime software can be simpliﬁed by producing two versions of the software and executing them on large numbers of test cases without manual or independent veriﬁcation of the correct output. The output is assumed correct as long as both versions of the programs agree. The argument is made that preparing test data and determining correct output is difﬁcult and expensive for much real-time software. Since it is assumed ‘‘unlikely’’ that two programs will contain identical faults, a large number of test cases can be run in a relatively short time and with a large reduction in effort required for validation of test results.
In addition, it has been argued that each individual version of the software can have lower reliability than would be necessary if only one version were produced. The higher required software reliability is assumed to be obtained through the voting process*. The additional cost incurred in the development of multiple software versions would be offset by a reduction in the cost of the validation process. It has even been suggested  that elaborate software development environments and procedures will be unnecessary
The important point to note is that all of the above arguments in favor of using redundant programming hinge on the basic assumption that the probability of common mode failures (identical incorrect output given the same input) is very low for independently developed software. Therefore, it is important to know whether this assumption is correct.
Several previous experiments have involved N-version programming, but none have focused on the issue of independence. In two [2,11] independence was assumed and therefore not tested. In each of these, the two versions developed were assumed to be correct if the two outputs from the test cases agreed and no attempt was made to verify independently the correctness of the output. Thus common errors would not necessarily have been detected. In other experiments, common errors were observed but since independence was not the hypothesis being tested, the design of the experiments make it impossible to draw any statistically valid conclusions. Kelly and Avizienis [3,4] report ﬁnding 21 related faults, one common fault was found in practical tests of the Halden nuclear reactor project , and Taylor  reports that common faults have been found in about half of the practical redundant European software systems.
In summary, although there is some negative evidence which raises doubts about the independence assumption, there has been no experiment which attempted to study this assumption in a manner in which clear evidence for or against can be drawn. Because the independence assumption is widely accepted and because of the potential importance of the issue in terms of safety, we have carried out a large scale experiment in N-version programming to study this assumption. A statistically rigorous test of independence was the major goal of the experiment and all of the design decisions that were taken were dominated by this goal.
*One might note that even in the hardware Triple Modular Redundancy (TMR) systems, from which the idea of N-version programming arises, overall system reliability is not improved if the individual components are not themselves sufﬁciently reliable . In fact, incorporating redundancy into a system can actually reduce overall system reliability due to the increased number of components .
describe the experiment itself, and we review the backgrounds of the programmers and their activities during the experiment in section three. The results of the tests performed on the various versions are presented in section four. Section ﬁve contains a model of independence and a statistical test of the hypothesis that the model is valid. Some of the faults that have been found in the programs used in this experiment are described in section six, and various issues arising from this experiment are discussed in section seven. Our conclusions are presented in section eight, and the requirements speciﬁcation used in the experiment is included as an appendix.
2. DESCRIPTION OF EXPERIMENTIn graduate and senior level classes in computer science at the University of Virginia (UVA) and the University of California at Irvine (UCI), students were asked to write programs from a single requirements speciﬁcation. The result was a total of twenty seven programs (nine from UVA and eighteen from UCI) all of which should produce the same output from the same input. Each of these programs was then subjected to one million randomly-generated test cases.
In order to make the experiment realistic, an attempt was made to choose an application that would normally be a candidate for the inclusion of fault tolerance. The problem that was selected for programming is a simple (but realistic) anti-missile system that came originally from an aerospace company. The program is required to read some data that represents radar reﬂections and, using a collection of conditions, has to decide whether the reﬂections come from an object that is a threat or otherwise. If the decision is made that the object is a threat, a signal to launch an interceptor has to be generated. The problem is known as the ‘‘launch interceptor’’ problem and the various conditions upon which the decision depends are referred to as ‘‘launch interceptor conditions’’ (LIC’s). The conditions are heavily parameterized. For example, one condition asks whether a set of reﬂections can be contained within a circle of given radius; the radius is a parameter.
a study of N-version programming with N equal to three that was carried out at the Research Triangle Institute (RTI). We chose this problem because of its suitability and because we were able to use the lessons learned in the experiment at RTI to modify our own experiment. RTI had prepared a requirements speciﬁcation and had experienced some difﬁculties with unexpected ambiguities and similar problems. We were able to rewrite the requirements speciﬁcation in the light of this experience. Thus the requirements speciﬁcation had been carefully ‘‘debugged’’ prior to use in this experiment.
The requirements speciﬁcation was given to the students and they were asked to prepare software to comply with it. No overall software development methodology was imposed on them. They were required to write the program in Pascal and to use only a speciﬁed compiler and associated operating system. At UVA these were the University of Hull V-mode Pascal compiler for the Prime computers using PRIMOS, and at UCI these were the Berkeley PC compiler for the VAX 11/750 using UNIX.