RAC is a DoD Information Analysis Center Sponsored by the Defense Technical Information Center and Operated by IIT Research Institute INSIDE T h e J o u r n a l o f t h e 7 Industry News 12 RAC Introduces Its New Professional Sustainment Program Series Insert Training Flyer 13 Program Managers Handbook Common Practices to Mitigate the Risk of Obsolescence 19 PRISM Column 21 Future Events 22 From the Editor Reliability Analysis Center Fourth Quarter - 2002 Introduction The complexity of modern digital systems, made more problematic by the microelectronics revo- lution, is a two-edged sword. Complexity cre- ates both the need for fault tolerance and the means to implement various solutions. With fault-tolerant computing, correct output is obtained in spite of the presence of internal errors called faults. Any system with redundant com- ponents or functions can be made fault-tolerant. For the designer who keeps fault-tolerance in mind during the early stages of design, many techniques are available: error-correcting codes added to input, internal, and output data streams; parallel, standby, or voting redundancy schemes; and where applicable, parallel paths through a network. Software also can be made redundant if one employs two independent software teams to develop different yet functionally identical ver- sions of the software; however, any common errors will limit the redundant effect. Availability. Redundancy also provides an opportunity to significantly increase the reliabili- ty, the probability of no system failure, by intro- ducing the possibility of repair during system operation. Furthermore, the availability, the probability that the system is up, is often com- puted from the ratio of (uptime)/(uptime + down- time), which also measures the beneficial effects of repair. The analysis of system availability generally involves the formulation and solution of a Markov probability model. The use of Laplace transforms and appropriate use of sim- plifying theorems greatly facilitates the analysis effort and provides insight into system behavior. The design of a large system often is driven by functional requirements; however, when reliabil- ity considerations are included, the apportion- ment of redundancy throughout the system can be used as a means of optimizing redundancy. There are several simple, heuristic apportion- ment schemes as well as a new technique of reli- ability optimization, which uses upper and lower bounds to limit the computations to a modest number of feasible solutions. Typical System Reliability and Availability. The reader can appreciate some of the gains in system reliability and availability achieved in the last half of the 20th century by studying the compar- ison in Table 1. Coding Everyone knows the cliché describing computer systems Garbage in, garbage out. One might describe error-correcting codes as Garbage in, correct information out. This is accomplished by adding extra (redundant) bits to the input that act as error checks. Schemes such as adding redun- dant bits are used to detect and correct errors in bit transmissions that occur in communication (e.g., modems, satellites, etc.), memory reads and writes in a CPU, and computer network messages. One can assume that errors in transmitted binary words occur due to noise that affects individual bits in a word, causing errors, i.e., zero bits cor- rupted to ones or one bits corrupted to zeroes. Assume that noise is rare and occurs in short pulses with duration less than the width of a transmission pulse. The simplest technique for error checking is to transmit each word twice, and then sequentially compare each bit in the first transmitted word with the corresponding bit in the second transmitted word. If a given bit is the same in both words, the bit is passed. If not, an error is detected and word can be retransmit- ted until two identical copies are then obtained, correcting the error. Clearly, this scheme at least By: Martin L. Shooman, Polytechnic University, Brooklyn, NY and Martin L. Shooman & Associates Reliability of Fault-Tolerant Computing Systems and Networks T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F o u r t h Q u a r t e r - 2 0 0 2 2 doubles the length of each transmitted word (i.e., 100% or more overhead is required for error detection). More efficient meth- ods exist with less overhead. Parity Bit Codes. The simplest scheme is to append a single bit called a parity bit to the word. If an even or odd parity scheme is chosen, the parity bit is set to 1 or 0 so that the number of 1s in the word (including the parity bit) is an even or odd number respec- tively. The parity bit is generally computed using a tree network of EXOR gates. The information word with parity bit attached is then transmitted and when the word is received, another EXOR tree is used to check that the parity is still even or odd. Clearly this scheme detects single, triple, and any odd number of bit errors. If the scheme is applied to an 8-bit byte, the parity bit is a ninth bit and the overhead due to checking is 12.5%. Parity bit error detec- tion codes have such small overhead that they are used in a wide variety of applications. There are inexpensive, standard code gen- eration (coder) and checking (decoder) digital circuits (74LS280). Of course the code does not detect double, quadruple, or any even number of errors since the parity is not changed in these cases. The probability of an undetected error is the probability that an even number of errors occur, which can be approximated by the most significant term the Binomial probability of two errors out of nine bits, which is given by 36q2 (1 - q)7 where q is the probability of bit error. Without parity bit checking, all errors are undetected and the binomial probability of one or more errors in 8 bits is given by 1 - (1 - q)8. The ratio of these two expressions gives the error reduction factor due to parity codes that can be as large as 2 x 107 when q = 10-8. This analysis neg- lects the possibility that an undetected error can occur due to a failure of the coding or decoding circuitry. When these failure probabilities are included in the analysis, the error reduction fac- tor is decreased, especially for low transmission rates and small values of q. For such values one has to wait many days or years before multiple errors occur and the probability of chip failure during this time period becomes significant. The results for a parity bit code are shown in Figure 1, where the downward curve at the top left of the diagram is due to chip failures. Hamming Codes. The parity bit code is the least complex of the many code types commonly used and is the simplest member of an entire family of codes developed by Hamming [Reference 1] that use extra check bits to detect multiple errors and correct some errors. Although the more complex Hamming codes have higher overhead, their error-correcting ability accounts for their wide usage. Thus, Hamming Single Error Correcting and Single Error Detecting (SECSED) and Hamming Single Error Correcting and Double Error Detecting (SECDED) codes are popular. All Hamming codes assume that the probability of mul- tiple errors in a transmitted word is small; however in the case of many storage devices, such as CDs, CD-ROMs, DVDs, hard and floppy disk drives, errors often occur in bunches called bursts. Bursts require a different formulation and the Reed-Solomon (RS) code is commonly employed, [Reference 5]. Redundancy Four major approaches are used to improve system reliability: (1) simplify the system to use fewer components or simpler algo- rithms and software; (2) lower component and software failure rates; (3) provide redundant components or logical paths through the system; and (4), for redundant components, repair failed hardware and software components restoring full redundancy. In this section, we will assume that the system designer has taken advantage of (1) and (2) although this is not always the case in practice, and focus on (3) and (4). One can view the material covered in the last section of this article on apportionment and redundancy optimization as an extension of redundancy as described in this section. Table 1. Comparison of Reliability and Availability for Various Fault-Tolerant Systems (Reference 3, pages 16, 17) Application Reliability Availability* 1964 NASA Saturn Launch computer R(250 hr.) = 0.99 ------ Apollo NASA Moon Mission R(mission) = 15/16 = 0.938 ------ NASA Space Shuttle R(mission) = 109/110 = 0.991 Through flight 110 in April 2002. Note: Challenger flight exploded on 1/28/86. ------ Bell Labs ESS Telephone Switching System, 1966 ------ Requirement of 2 hr. of downtime in 40 yr. 0.9999943 Tandem computer goals in 1980s ------ Tandem goals 0.999996 Stratus computer (web site quote 2001) ------ Stratus quote 0.9999905 Vintage 1985 Single CPU, transaction-processing ------ 0.997 [Reference 4, p. 586] Vintage 1985 Two in parallel CPU, transaction-processing ------ 0.999982 Vintage 1985 Two in standby CPU, transaction-processing ------ 0.9999911 *Note that the ESS availability goal of 40 years is clearly a steady state availability. Although not stated, the Tandem, Stratus, and Vintage 1985 num- bers as well as the derived parallel and standby system availabilities are undoubtedly steady state values. T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F o u r t h Q u a r t e r - 2 0 0 2 3 In general redundancy improves system reliability (and availabil- ity) because it provides alternate paths through the system so that if one fails, there is another that continues operation. Redundancy was prevalent during the era of analog circuitry in the mid 20th century. However, with the advent of digital circuitry in the later half of last century, it became easy to implement a wide variety of simple as well as complex redundancy schemes. System and Component Redundancy. Most systems are com- posed of components. Thus, reliability can be considered at a system or component level. Consider, for example, a system composed of two components A and B that must both work for successful operation. If one employs system redundancy, a sec- ond system A'B' is connected in parallel so that if either system fails, the second system takes over. If components A and B have an equal probability of failure q, then a single system has an approximate probability of failure of 2q. In the case of system redundancy, both AB and A'B' must fail for the system to fail and if the two replicas are identical the approximate probability of failure is (2q) x (2q) = 4q2. However, there is another possibility of using component redun- dancy, i.e., to parallel A and A' and parallel B and B'. In such a case, the failure probability of the AA' cluster is approximately q2 and that of the BB' cluster is approximately q2 and the failure of either cluster 2q2. Thus, the reliability using system redun- dancy is superior to a single system (2q > 4q2), and the reliabili- ty using component redundancy is still better (2q > 4q2 > 2q2). Detailed analysis verifies this result. Thus far, we have neglected the device needed to connect items in parallel, which can be termed the coupling device. A complete analysis of the two redundancy schemes just discussed requires including the reliability of the coupling device. Clearly, the relia- bility of the coupling device appears as a series element and when system reliability is used one coupling device is required and the reliability of the coupler is rc. If there are five major components in the system and component redundancy is used, the reliability of the system of couplers is (rc)5. If coupler unreliability is signifi- cant, it can erode the advantages of component reliability over system reliability, and a detailed analysis is needed. Standby Systems. In a parallel system the extra redundant sys- tem(s) or component(s) is powered up and can fail although only one system or component is needed for system operation. A standby system improves on this situation by using one on-line system or component to provide operation and keeping the power off for the standby elements. As long as the non-powered standby elements have very low or zero failure rates, the stand- by system is superior to the parallel system. The reliability of a single element with a constant failure rate l is rs = e- lt and that of a parallel system is rp = 2e- lt - e-2 lt . One can show that the relia- bility of a standby system is rs = e- lt + lte- lt and that it is always better than a parallel system. However, the reliability of the coupler can have a significant effect on the results. In the case of a standby system, the cou- pler is more complex than in a parallel system. In essence it is a smart switch, which detects the failure of the on-line unit, pow- ers up the standby unit, and switches the inputs and outputs to the standby unit. A simple comparison of the two systems can be made by neglecting the coupler effect in a parallel system and assuming that the reliability of the switch in a standby system is given by e- lSt and that it multiplies the expression for rs. A com- parison of the two expressions is given in Figure 2. This analy- sis shows that the standby system is superior to the parallel sys- tem as long as the failure rate of the switch is 0.5l or smaller. Figure 1. Undetected Error Improvement Ratio for a Single Byte with a Parity Bit Versus Error Probability q and Transmission Rate B in Bits per Second (Reference 3, Figure 2.5, page 45) 10-8 10-7 10-6 10-5 107 106 105 Bit Error Probability, q B = infinity B = 56000 B = 9600 B = 1200 B = 300 Improv eme ntRat io T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F o u r t h Q u a r t e r - 2 0 0 2 4 Repair. Repair can further improve reliability in a system with redundancy when repairs of failed redundant paths can be made while the system continues to operate. The beneficial effects of repair in the case of a parallel or standby system are obvious. When one item fails, it is repaired while the other item continues operation. The system only fails if the remaining unit goes down before the failed unit is repaired and placed back in service. Thus, it is a race to see whether repair occurs before a second failure happens. However, the repair rate is much larger than the failure rate in all practical cases; thus, repair continues to win for many cycles and the time to system failure is greatly increased by repair. For example, in the case of an automobile, we might estimate that the failure rate (unscheduled maintenance) is once per year. However, the repair plus waiting time is typically about one day yielding a repair rate of 365 repairs per year. Markov Models. For a complete analysis of repair effects, we must make a Markov model for the system with repair. The first step is to identify the system states. As an example let us con- sider two elements with repair (either parallel or standby). The elements are one and two and the state where both are good is S0 = x1x2 where both are bad S2 = 1 2 and where one is good S1 = 1 x2 + x1 2 . These three states are shown as nodes in Figure 3. The arcs between states, which go from left to right, represent failures and are labeled with failure rates l and the arcs directed from right to left represent repair and are labeled with repair rates m. The probability of being in each of the states is obtained by solving a first order differential equation for each probability and, since the equations are coupled, the solution becomes that of a third order differential equation. The Laplace transform method can be used to simplify the solution work, however much algebra is still involved. However, if we are satisfied with less than the complete solution three Laplace transform theorems for the initial value, the final value, and the integral of a function can be used to easily find: (a) the mean time to failure, (b) the final value, and (c) the Taylor series approximation for the initial behavior. If we are only interested in the system mean time between failure, MTBF, we obtain the results given in Table 2. Note that if the repair rate is 10 times the failure rate, a standby system has 12 times the MTTF of a single element, which is 6 times better than without repair. For practical systems where the repair rate greatly exceeds the failure rate, very large gains in system MTTF are realized. Figure 3. A Markov Reliability Model for Two Identical Parallel Elements and k Repairmen (Reference 3, Figure 3.14, page 112) Table 2. Comparison of the MTTF for the Various Systems Represented by Figure 3 (Reference 3, page 115) Figure 2. A Comparison of a Two-Element Ordinary Parallel System with a Two-Element Standby System with Imperfect Switch Reliability (Reference 3, Figure 3.13, page 110) 1.0 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Normalized Time t = l t Standby ls = 0.5l Standby ls = l Standby ls = 0 (perfect switch) Standby ls = 0.1l Parallel Reliability R(t) x x x x 1- l' Dt 1- (l + m') Dt 1 S0 = x1 x2 S1 = x1 x2 + x1 x2 S2 = x1 x2 l Dt l' Dt m' Dt Element Formula MTBF For l = 1, m = 10 Single 1/l 1.0 2 parallel no repair 1.5/l 1.5 2 standby no repair 2/l 2.0 2 parallel repair (3l + m)/2l2 6.5 2 standby repair (2l + m)/l2 12.0 where: l' = 2l (ordinary system) l' = l (standby system) m' = m (one repairman) m' = km (k repairman) T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F o u r t h Q u a r t e r - 2 0 0 2 5 Related Concepts. Space does not allow a full description of many related issues; however, a few will be briefly discussed. Availability: Reliability measures the probability that a sys- tem is always up, working in interval 0 to t. However, cases when a system goes down for a few minutes and is quickly repaired are considered examples of good availability. Availability is the probability of being up at time t (even if it has failed and been repaired one or more times) and can be calculat- ed from a Markov model such as the one shown in Figure 3, if an additional repair branch is added from state two (S2) back to state one (S1). Coverage: In a standby system the coupler must sense failure of the on-line element and switch in the standby one; oth- erwise the system fails. Since the coupler cannot detect all fail- ures, one should model the fraction detected (called coverage) in a Markov analysis. Typical values for coverage are 0.90 and 0.95. Low coverage can significantly decrease the improvement in reliability achieved with redundancy. N-Modular Redundancy: The problems in designing a reli- able coupling device in a parallel or standby system can be avoided if N-Modular redundancy is used. In the most common case, N = 3, and the method is called TMR, triple modular redun- dancy, where three copies of a digital circuit are employed and their outputs are compared in a simple digital circuit that pro- duces an output agreeing with the majority. Thus, the system succeeds if there are no failures or one failure, the reliability is better than a single component in the high reliability region, (lt is small), and the availability can be high with repair. NonStop Computing: Many computing applications require very high levels of reliability and availability and Tandem com- puters (now a division of Compaq) coined the term NonStop computing in the 1980s to appeal to the on-line transaction pro- cessing market. Their system utilized software and hardware redundancy, as did their competitor Stratus, to achieve high lev- els of availability, (see Table 1). RAID: Data storage in modern computer system is general- ly implemented using a Redundant Array of Independent Disks (RAID). Many RAID techniques employ redundant disks and error-correcting bits applied to the data and spread over many disks. These techniques utilize 50 to several hundred disks and achieve huge storage densities and high reliability. Networks Much of the world is interconnected by a network(s). For many, this is the Internet at home and a local area network at work link- ing computers, printers, workstations, and the Internet. Several methods can be used to compute the reliability and availability of a network. However, we will focus on the most straightfor- ward approach. First, one should state that the network field uni- formly computes the availability of a network and call this the network reliability. The model of a network is based on the mathematical graph of the network where network nodes (ver- tices) represent computers, printers, terminals, etc. and the branches (arcs) in the network represent the communication links between the nodes. The simplest model assumes that the nodes do not fail but the links do. We define a path as a collec- tion of branches that provides communications between a speci- fied pair of nodes. The four-node network shown in Figure 4 has six branches. One method of analyzing the network is to examine all combinations of the six branches, where each branch can be either up or down. For example, the combination called event 41 is denoted by E41 = 123456 where links 1, 3, and 4 are down but 2, 5, and 6 are up. The combination E41 provides a path between nodes a and b, so we would say that E41 provides two terminal connections between ab. A detailed study of E41 shows that it also provides a connection between all the other node pairs. Thus, we can say that E41 provides all terminal connection. Since each of the branches has two states (up and down) and there are 6 branches, there are 26 = 64 different combinations of events. All of the events as formulated are mutually exclusive. Thus, the two ter- minal reliability between nodes s and t is the sum of the proba- bilities of each event that provides connection between s and t. Similarly, the all-terminal reliability is the sum of the probabili- ties of all the events that provide all terminal connection. Again, we emphasize that the term reliability really means availability because we speak of each branch being up or down, which means that they can fail and be repaired. Figure 4. A Four-Node Graph Representing a Computer or Communication Network (Reference 3, Figure 6.1, page 285) The preceding method for calculating reliability is called State- Space Enumeration and will become clearer as we fill in the details for calculating the two terminal reliability, Rab. The eas- iest way to explore the 64 combinations is to group them by the number of failed branches and explore each grouping. There is only one combination of no failures, E1 = 123456, and clearly this provides a connection between a and b. There are 6 combi- nations of one failure and all provide connection between a and b: E2 = 123456, E3 = 123456, E4 = 123456, E5 = 123456, E6 = 123456, and E7 = 123456. Clearly, as long as edge 1 is up, there is a connection, and E3 through E7 provide a connection. In the case of E2 although 1 is down, 25 provides a connection. In a similar manner, all the events are explored and the conclusion is that there are 48 good events, [the details are given in 1 2 3 4 5 6 b cd a T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F o u r t h Q u a r t e r - 2 0 0 2 6 Reference 3, pp. 288-292]. The reliability expression is given by Equation (1). Rab = [P(E1)] + [P(E2) + ×××× + P(E7)] + [P(E8) (1) + ×××× + P(E22)] + [P(E23) + ×××× + P(E34) + P(E37) + ×××× + P(E42)] + [P(E43) + ×××× + P(E47) + P(E50) + P(E56)] + [P(E58)] If all edges are identical and have a probability of being up of p and a probability of being down of q, then Equation 1 becomes: Rab = p6 + 6 p5q + 15 p4q2 + 18 p3q3 + 7 p2q4 + pq5 (2) For the case where p = 0.9 and q = 0.1, substitution in Equation (2) yields an availability of 0.997848. To improve network reliability we would have to increase p or add more branches to the network. Many other more sophisticated techniques are available for com- puting the availability of a network, including approximate tech- niques and computer programs to help with the computations. There are network design techniques that help with the design of a reliable network given a set of nodes to connect, and others that suggest how to obtain the biggest increase in reliability by adding additional branches to an existing network. Apportionment and Redundancy Optimization In discussing redundancy in the preceding section, it was assumed that the unit in question was a component or a modest- ly sized system. A more global issue is, given a large system that has unsatisfactory reliability or availability, how can we best improve the reliability or availability by using a limited amount of redundancy in the system. This is essentially top down redun- dancy and although less efficient than bottom up redundancy, it fits in well with a system engineering approach to system design. Assume a block diagram represents the overall structure of the system, where each level of decomposition is represented by n submembers. If the decomposition produces 2 or 3 submembers, too little decomposition occurs; if there are more than 10 sub- members things are too complex. For a rationale for why 5-9 sub- members is a good goal [Reference 3, pp. 337-342]. In general, one has an overall system goal Rs (or As) and the reliability of all the submembers combine to yield the system reliability. If all the n submembers are independent and have an equal reliability r, then Rs = rn. Since we know Rs we can solve for r obtaining r = (Rs)1/n. This is a gross approximation and an easy first start. Assigning a set of reliabilities for the submembers that equals the system goal is generally called apportionment. A brief study of the reliability of each of the submembers will show how easy it is to reach the r goal for each submember. In general, some are reachable and some will be very difficult. Seldom does one discover that all the submember goals are easily met. However, sometimes all of them appear to be insurmountable goals. A better procedure than the equal reliability assumption is to use the preliminary analysis of the subsystems to establish a set of ratios of the initial failure rates. Suppose that n = 5 and the ini- tial design of subsystems one and two seem to have roughly equal reliabilities and a failure rate l, then l1 = l2 = l. However, sup- pose three and four are more difficult and have roughly double the failure rate, l3 = l4 = 2l. Furthermore, subsystem 5 is the most difficult and has a failure rate l5 = 6l. Although these are all just initial estimates, one can use them to obtain a better second round of estimates. The system reliability becomes: Rs = (e- lt )(e- lt )(e-2 lt )(e-2 lt )(e-6 lt ) = e-12 lt (3) Solving for l yields l = -ln(Rs)/12 and, using this value for l, we obtain reliability goals for each of the submembers. It is always useful to apply these simple techniques of apportionment (and sometimes the more complex procedures [Reference 3, pp. 342- 351]) in the initial design stages. One can formulate a more exact approach to reliability opti- mization using redundancy. As an example, assume that n = 3, that the initial estimates for the reliability of each subsystem is R1 = 0.85, R2 = 0.5, and R3 = 0.3. The system goal is Rg = 0.9; however, the initial design has a reliability of 0.85 x 0.5 x 0.3 = 0.1275. Each subsystem has a cost of 1 unit and the total cost budget is 16 units; thus 13 units are available to assign parallel elements to the three subsystems to achieve the system reliabili- ty goal. Our objective is to find the set of solutions where the system reliability Rs ³ Rg, the cost of each of these solutions, and then to choose among these possibilities. A brute force approach is to recognize that given the cost con- straints, we could assign to subsystem one, 0 to 13 parallel redun- dant units. Similarly, there are 14 different choices for subsystems two and three, and the total space of possible combinations is 14 x 14 x 14 = 2,744! Clearly with modern computational power, a brute force computation and ordering of the 2,744 cases is possi- ble. However, such an approach becomes impossible for any prac- tical size system. In the past, gradient search approaches (greedy algorithms) and dynamic programming have been used to reduce the number of cases that must be compared [Reference 3, pp. 369- 378]. Recently, a bounded enumeration approach has been dis- covered that allows a great reduction in effort only 10 combina- tions need to be explored [Reference 3, pp. 351-366]. The first bounds to apply are the lower bounds that are derived from the fact that each subsystem must have a reliability equal to or exceeding the system goal and these bounds assume all other subsystems have a reliability equal to unity. Thus, (1 - [1 - Ri]n i ) > Rg (4) Solving for ni from Equation (4) yields: ni = log(1 - Rg)/log(1 - Ri) (5) For the three subsystems of our example, n1 = 1.21, n2 = 3.32, and n3 = 6.46. Rounding to the next highest integer, n1 = 2, n2 = 4, (Continued on page 11) T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F o u r t h Q u a r t e r - 2 0 0 2 11 Unreliable System Costs >75,000 Patients Lives Annually, Reliability Review, published by the Reliability Division of ASQ, September 2002, page 3. Poor process reliability is result- ing in the spread of infections in hospitals leading to more than 75,000 deaths annually. These infections are preventable if the same approach used to develop and control industrial processes is applied by hospitals. Step-Stress Accelerated Life Testing: Test Plan, Model, and Application, Reliability Review, published by the Reliability Division of ASQ, September 2002, page 13. A plan and model for conducting accelerated life testing is presented together with a case study in which the model is applied. Human Systems Integration, Program Manager, published for the Defense Acquisition University, July-August 2002, page 88. The authors explore the importance of designing for interaction of human beings with everything in the environment associated with systems. They describe the organizational and analytical processes associated with human systems integration. ISO/TS 16949 the Clear Choice for Automotive Suppliers, Quality Progress, published by ASQ, October 2002, page 44. General Motors Corp., Daimler-Chrysler, and Ford Motor Company are requiring their suppliers to transition from QS- 9000 to ISO/TS 16949 and to be certified for the latter by July 2004. The change is intended to facilitate access to various world markets. Reliability of Fault-Tolerant . . . (Continued from page 6) and n3 = 7. Thus, 13 units have been allocated (including the ini- tial 3), and only 3 cost units are left. Consequently, four addi- tional possibilities remain for each of the three units, 0,1,2,3 addi- tional units, and the number of cases to enumerate is 4 x 4 x 4 = 64. This number can be further reduced by calculating the upper bounds. Consider the allocation of 3 units to subsystem 1. If this is done, no additional redundancy can be applied to subsystems 2 and 3 because all the resources are expended. If we consider allocations above the minimum bound as incremen- tal policies, Dn1 = 3, Dn2 = 0, Dn3 = 0, or, more compactly, the incre- mental policy is (3,0,0), and sixteen of the 64 combinations are eliminated. Similar reasoning leads to the conclusion that (2,1,0) and (2,0,1) are allowed policies. Similarly (1,2,0), (1,1,1), and (1,0,2) are possible policies. Lastly, (0,3,0), (0,2,1), (0,1,2), and (0,0,3) are possible solutions for a total of 10 policies to explore. The solutions outlined are clearly displayed by the search tree shown in Figure 5. The branches radiating from the top level node represents the four incremental choices for n1, and the branches radiating from these four nodes represent the nine possible combi- nations for n1 and n2. The final branches represent the six addi- tional choices that can be made. The numbers at the bottom of each terminating path are the reliabilities achieved by the policy. Inspection of Figure 5 shows that the optimum incremental pol- icy is (1,1,1) with a reliability of 0.9098 and that the incremen- tal policy (0,1,2) is a close second with a reliability of 0.9087. Practical considerations that are unstated may make the choice of the second or perhaps third best policy advisable in some cases. This simple example explains the method and illustrates the huge reduction in computational cases due to the lower and upper bounds. It is fairly easy to program the algorithm and a personal computer provides ample computational power even in a large-scale problem. Conclusion This article has highlighted some of the more important topics that must be considered in the area of fault-tolerant computing. Because of the complexity of digital systems, many of these techniques must be used in combination to design a highly reli- able, highly available complex system. References 1. Hamming, R.W., Error Detecting and Correcting Codes, Bell System Technical Journal, 29 April 1950, pp. 147-160. 2. Shooman, M.L. and C. Marshall, A Mathematical Formulation of Reliability Optimized Design, Proceedings of 16th IFIP Conference on System Modeling and Optimization, Compiegne, France, July 5-9, 1993. (Also published in Henry, J. and J.P. Yvon, System Modeling and Optimization, Lecture Notes in Control and Information Sciences 197, Springer-Verlag, London, 1994, pp. 941-950.) n1 n2 n3 0.8602 0.8830 0.8885 0.9002 0.9098 0.8966 0.9067 0.9087 0.8905 0.8900 3 2 1 0 0 1 2 01 0123 1 2 31 2 1 Figure 5. A Search Tree for the Example (Reference 3, Figure 7.6, page 359) T h e J o u r n a l o f t h e R e l i a b i l i t y A n a l y s i s C e n t e r F o u r t h Q u a r t e r - 2 0 0 2 12 3. Shooman, M.L., Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design, John Wiley, New York, 2002. 4. Siewiorek, D.P. and R.S. Swarz, Reliable Computer Systems Design and Evaluation, 3rd ed., A.K. Peters, , 1998. 5. Wicker, S.B. and V.K. Bhargava, Reed-Solomon Codes and their Applications, IEEE Press, New York, 1994. About the Author Martin L. Shooman, PhD, is currently President of the consult- ing firm of Martin L. Shooman & Associates. He served for many years as a Professor of Electrical Engineering and Computer Science at Polytechnic University in Brooklyn, NY. Dr. Shooman has been a Visiting Professor at MIT and Hunter College, and a consultant to Bell Laboratories, NASA, IBM, the U.S. Army, and many other government and commercial organ- izations. A Fellow of the IEEE, Dr. Shooman has received five Best Paper awards from the IEEE Reliability and Computer Societies. He has contributed to over 100 papers and reports to the research literature, and has given special courses in Britain, Canada, France, Israel, and throughout the U.S. Dr. Shooman is the author of Probabilistic Reliability: An Engineering Approach and Software Engineering: Design, Reliability, and Management. His latest book is Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design. Published in January 2002, the book is 552 pages and is priced at $110.00. It can be ordered from the publisher, John Wiley & Sons. ISBN: 0-471-29342-3 NORTH AMERICA: John Wiley & Sons, Inc. 1-800-225-5945 REST OF THE WORLD John Wiley & Sons, Ltd. +44(0) 1243 779777 RAC Introduces Its New Professional Sustainment Program Series By: Tim Cathcart and Joel Manary Introduction Sustainment professionals need a parallel set of skills and tools. One set needs to focus on management aspects of integration of support elements and integration of sustainment issues with other program management functional areas. Numerous training and cer- tification programs address this need. The other set needs to focus on the engineering and other technical aspects of sustainment in a practical and user-oriented fashion. No training and certification in a condensed and user-oriented context has been available. The Reliability Analysis Center (RAC) has recently developed a new professional certificate program specifically designed for the unique problems currently facing the sustainment communi- ty. The Professional Sustainment Program provides the essential technical skills, methods, and tools to implement many new strategies and principles required in economically sustaining an enterprise and the products created by that enterprise. The Professional Sustainment Program series consist of several courses that comprehensively integrate traditional sustainability analysis and operational effectiveness methodologies with new research findings from several fields of studies, such as supply chain management, integrated lean enterprises, and e-logistics. The Office of the Secretary of Defense has directed all DoD logistic-wide initiatives to adopt commercially proven prac- tices and strategies. These logistic transformation objectives include the implementation of many commercial practices such as supply chain management and agile manufacturing and MRO (Maintenance Repair Overhaul) concepts. Many DoD organiza- tions have established transformation offices to implement these new strategies. The organizations using system continue to demand increased attention to sustainment issues affecting sys- tems readiness, and the responsiveness of the logistics infra- structure. The rush to field systems without settling sustainabil- ity requirements continues to plague acquisition projects. Professional Sustainment Program The Professional Sustainment Program series was developed to provide an educational resource for the sustainment community. It will help them develop the management and technical skills needed to design and implement cost-effective, integrated sus- tainment networks and agile organizational structures, while still addressing the unique problems facing the sustainment commu- nity such as aging systems and providing life cycle support for Commercial-Off-The-Shelf items. Commercially proven supply chain management and lean enter- prise practices have significantly benefited the manufacturing and retail industries but have been difficult to apply in the defense industry because of the high degree of variability in both source material and low volume production requirements. Under ideal conditions, a sustainment supply chain network would be responsive and flexible enough to meet varying demand conditions. The right types of material and parts would be available in the right quantities, at the right place, at the right time, at an affordable cost. Parts and material shortages coupled with increased maintenance requirements are just some of the issues facing the sustainment community in todays environ- ment. The logistic transformation from a cold war mass pro- duction operational model into a smaller and more mobile lean and agile post cold war operational model will require signifi- cant cultural and infrastructure changes.