Maximize the Menu
 
Ensuring Reliable Performance

About the RIAC Blueprints
The RIAC "Blueprints for Product Reliability" are a series of documents published by the Reliability Information Analysis Center (RIAC) to provide insight into, and guidance in applying, sound reliability practices. The RIAC is the Information Analysis Center chartered to be a centralized source of data, information and expertise in the subjects of reliability, maintainability and quality. While sponsored by the US Department of Defense (DoD), RIAC's charter addresses both military and commercial communities with the requirement to disseminate guidance information in these subjects. The Blueprints serve to provide information on those approaches to planning and implementing effective reliability programs based on experience, lessons learned, and state-of-the-art techniques. To make the Blueprints as useful as possible, the approaches and procedures are based on the best practices used by commercial industry and on the concepts documented in many of the now-rescinded military standards. The tree shown in Figure 1 depicts the Blueprints that make up the series.

In the government sector, and in particular the DoD, significant changes have been made regarding the acquisition of new products. Previously, by imposing standards and specifications, a DoD customer would require contractors to use certain analytical tools and methods, perform specific tests in a prescribed manner, use components from an approved list, and so forth. Current policy emphasizes the use of commercial technology as well as specifying "performance-based" requirements only, with suppliers left to determine how to best achieve them.

Figure 1. RIAC Blueprints for Product Reliability
Figure 1. RIAC Blueprints for Product Reliability (Click to Zoom)

Users of the RIAC Blueprints The Blueprints are designed for use in both the government and private sectors. They address products ranging from completely new commercial consumer products to highly specialized military systems. The documents are written in a style that is easy to understand and implement whether the reader is a manager, design engineer or reliability specialist. In keeping with the new philosophy of the DoD, which is now similar to that of the private sector, the Blueprints do not provide a cookbook of reliability tasks that should be applied in every situation. Instead, some general principles are cited as the underpinnings of a sound reliability program. Then, many of the tasks and activities that support each principle are highlighted in detail sufficient for the user to determine if a task or activity is appropriate to his or her situation.



SECTION ONE - INTRODUCTION

The purpose of this Blueprint, Ensuring Reliable Performance, is to provide guidance in planning and implementing actions that will ensure that a product achieves and maintains an acceptable level of reliability performance over its entire life cycle, i.e., it continues to meet its customers'reliability needs.

Reliability is traditionally considered to be a performance attribute that is concerned with the probability of success and frequency of failures, and is defined as:

       The probability that an item will perform its intended function understated conditions, for either a specified interval or over its useful life.

A distinction needs to be made, however, between inherent reliability and achieved reliability. Inherent reliability is that which should be expected based on the product design approach if production and handling factors do not degrade it. Achieved reliability is that which is achieved during actual customer use when the inherent reliability may have been degraded by production and handling factors.

There are many actions that may be taken to ensure that a product will be reliable. Each has its own benefits and penalties. An effective product development will include a mix of reliability tasks that are selected to be cost-effective to that particular program and to add value for the customer. Section Two discusses the relationship of inherent and achieved reliability, and Section Three presents reliability tasks that have proven useful in ensuring reliability, which can be applied where they add value to a product.

The discussion of each reliability task will consider:
  • Purpose (what)
  • Benefit (why)
  • Timing (when)
  • Application guidelines (how)



SECTION TWO - INHERENT VS. ACHIEVED RELIABILITY

Reliability tasks should be selected to fit a specific product development program. If a commercial off-the-shelf product is purchased, it makes little sense for the customer to indulge in a statistical design of experiments, since he does not control the design or production of the product. On the other hand, the use of environmental stress screening may prove quite useful in ensuring the reliability of a purchased product before passing it on. To effectively tailor a reliability program, a program manager should be able to determine which elements will be most useful in achieving the program objective. This Blueprint will provide a familiarity with those program tasks concerned with ensuring reliability.

2.1 Tailoring the Program

It is often said, "reliability must be designed into a product." This emphasizes the fact that nothing can make a poor design reliable. However, it is quite possible, and quite common, for a good design to be compromised by other factors. For example, a poor lot of parts, faulty workmanship, shipping/transportation stresses, or inadequate manufacturing processes that introduce defects can cause failures even in a well-designed product. Hence, the achieved reliability of the product (as experienced by the user) may be far worse than the reliability inherent in its design.

A reliability prediction made for a product is based on its design and is an estimate of inherent reliability, since it assumes part failure rates, manufacturing quality, and handling factors are all as expected. To assure that the achieved reliability is reasonably close to predicted reliability, the product developer may undertake a variety of reliability tasks. The exact mix of tasks will depend on the particular product objectives.

Table 1 provides preliminary guidance as to the usefulness of the reliability tasks that will be discussed in Section Three under a variety of program scenarios. In the table, a "plus" sign (+) indicates that the activity offers value to the product under that circumstance. A "minus" sign (-) means that the activity is probably not cost effective for that circumstance. A "question mark" (?) indicates that the activity may or may not add value for that circumstance, depending on the type of product. Program variables considered are New Development (i.e. a product will be designed and built for the first time), COTS (an item available as a commercial off-the-shelf product), Safety Critical (e.g., a nuclear plant control system), Dormancy (i.e., an item to be subjected to long periods of storage or other nonoperating usage), Long Life (an item likely to be in service for a relatively long time, like the B-52 bomber), Harsh Environment (e.g., high shock, rapid thermal cycling, et. al.) and S/W (Software Development). There are other factors not in the table which should also be considered. These might include the suppliers' reputations, the leverage of the producer in dealing with suppliers, the customer's expectations, and the relative importance of reliability and program cost.

Table 1. Techniques to Ensure Reliability
Element New Dev. COTS Safety Critical Dormancy Long Life Harsh Env. S/W Dev.
Critical Item Control + - + ? + + ?
Design of Experiments + - ? ? + + ?
Environmental Characterization + + + + + + ?
Environmental Stress Screening ? + + ? ? + -
FMECA + - + + + + +
FRACAS + + + + + + +
Inspection ? + + + ? ? +
Life Cycle Plannin + - + + + + ?
Market Survey + + + + + + +
Parts Obsolescence + + + ? + ? -
PRAT ? ? + ? ? ? ?
Repair Strategy + - + ? ? ? +
Statistical Process Control + ? + ? ? ? -
Supplier Control + + + ? + + +
Test Strategy + - + + + + +

2.2 Product Program Phases

Each product, from the simplest to the most complex, passes through a sequence of phases during its life cycle. The definitions of the phases vary among commercial companies, and within the military. Table 2 describes the sequence of general phases that will be used in this document to describe a product's life, and the appropriate timing of applying tasks to ensure reliable product performance.

Table 2. Product Life Cycle Phases
Concept/ Planning Design/ Development Production/ Manufacturing Operation/ Repair Wearout/ Disposal
  • Formulate ideas, estimate resources and financial needs
  • Identify risks & requirements
  • Program objective
  • Identify and allocate needs and requirements
  • Propose alternate approaches
  • Design and test the product
  • Develop manufacturing, operating, and repair/ maintenance tasks
  • Refine and implement manufacturing procedures
  • Finalize production equipment
  • Establish quality processes
  • Build & distribute the product
  • Implement operating, installation and training procedures
  • Provide repair and maintenance service
  • Repair warranty items
  • Provide for performance feedback
  • Implement refurbish- ment and disposal tasks
  • Resolve potential wearout issues

What distinguishes one phase from the next is generally a decision milestone, sometimes referred to as a "gate." It represents a point in time where the program can go forward or stop. For many products, the phases may be abbreviated or combined. For example, the Concept/Planning and Design/Development phases may be combined under a compressed schedule for a new product that is simply an update or slightly modified version of an older, proven product. Reliability tasks for this type of program would concentrate only on the differences between the old and the modified product. As a result, the number of engineering tasks would be reduced. Tasks required to ensure product reliability, however, would be applied to both the old and new elements of the modified product. It is important to understand that tasks performed in one phase are often the result of the analysis, trade-offs and planning performed in an earlier phase. For example, to ensure that the inherent reliability of printed circuit boards is retained throughout the Production/Manufacturing phase of the product life cycle, test and repair strategies would need to be developed during the Concept/Planning phase.

2.3 Tasks to Ensure Inherent Reliability

Section Three of this Blueprint provides insight into those tasks that will help to ensure that the inherent design reliability of the product is maintained during customer use. Table 3 represents those tasks (historically classified as design, analysis and test) that have been proven to have a positive influence on ensuring reliable product performance when properly tailored to add value for the customer.



SECTION THREE - TASKS FOR ENSURING RELIABLE PERFORMANCE

3.1 Life Cycle Planning

3.1.1 Purpose. Life cycle planning is the development of design guidance, and test and repair strategy through consideration of the expected conditions impacting the product from its introduction into the market place to its disposal. It helps ensure the achievement of satisfactory reliability by accounting for the environmental and use factors which will impact the reliability.

3.1.2 Benefit. Life cycle planning provides the best possible way for ensuring that the product will operate reliably and economically over its planned lifetime. The product must have enough strength to meet its most severe environmental stress, and enough endurance to last its entire life cycle. Without life cycle planning, the product may be underdesigned, resulting in poor achieved reliability and dissatisfied customers, or overdesigned, resulting in unnecessary product costs.

Table 3. Reliability Tasks Relevant to Ensuring Reliable Performance
Type of Activity Tasks and Description Section
D
E
S
I
G
N
Critical Item Control. Monitoring in-house and suppliers'activities to reduce the risk to product reliability from items identified as critical. Can include hardware and software. 3.2
Environmental Characterization. Iterative assessment of the operational stresses the product can be expected to experience to ensure inherent reliability reflects actual use. 3.3
Supplier Control. Monitoring suppliers' activities to assure that purchased hardware and software will have adequate reliability. 3.4
A
N
A
L
Y
S
I
S
Design of Experiments (DOE). Systematically determining the impact of process and environmental factors on a desired product parameter, in order to reduce product variability by controlling the factors. 3.5
Failure Modes, Effects&Criticality Analysis (FMECA). Systematically determining the effects of part or software failures on the product's ability to perform its function. This task includes FMEA. 3.6
Failure Reporting Analysis&Corrective Action System (FRACAS). A closed- loop system of data collection, analysis and dissemination to identify and correct failures of a product or process. 3.7
Life Cycle Planning. Determining reliability (and other) requirements by considering the impact over the expected useful life of the product. 3.1
Parts Obsolescence. Analysis of the likelihood that changes in technology will make the use of a currently available part undesirable. 3.8
Repair Strategies. Determination of the most appropriate or cost effective procedures for restoring product operation after it fails. 3.9
T
E
S
T
Environmental Stress Screening. Operating a product under high stress to identify defects (by causing them to become failures) in order to eliminate them before a product is shipped to the customer. 3.11
Production Reliability Acceptance Test (PRAT). Testing a product during production to ensure that its inherent reliability has not degraded. 3.12
Test Strategy. Determination of the most cost effective mix of tests for a product. 3.10
O
T
H
E
R
Statistical Process Control (SPC). Comparing the variability in a product against statistical expectations to identify any need for adjustment of the production process. 3.13
Market Survey. Determining the needs and wants of potential customers, their probable reaction to potential products, and their level of satisfaction with existing products. 3.14
Inspection. Comparing a product to its specifications, as a quality check. 3.15

3.1.3 Timing. Life cycle planning should be done in the Concept/Planning phase before the product is designed, as the design will be the largest determiner of its reliability and longevity. Life cycle planning should be used in developing environmental characterization approaches and test strategy. It may be necessary to redo life cycle planning during the Operation/Repair phase, when circumstances require the extension of the use of the product beyond its original planned useful life.

3.1.4 Application Guidelines. The key to life cycle planning is the determination of the environments to which the product will be subjected. A personal computer for home use will not experience stresses as severe as the engine controls of an airliner and this difference will translate into different design rules and test strategies needed to ensure adequate achieved reliability. The transportation, handling and storage stresses expected must also be determined for realistic planning to ensure reliable performance. Besides the physical environment (e.g., temperature, vibration, etc.), the use environment (e.g., speed, cycle rate, miles, etc.) should also be considered, especially for mechanical parts which may exhibit wearout modes of failure. If the product will be subjected to significant periods of dormancy during its lifetime, its design and packaging must include remedies for related failure mechanisms (e.g., corrosion, outgassing, etc.).

3.2 Critical Item Control

3.2.1 Purpose. A critical item is one that, by virtue of high cost, limited availability or known reliability problems, could jeopardize the success of the product. Critical item control represents those actions taken to prevent problems with those items having a negative impact on product reliability, thereby ensuring reliable performance.

3.2.2 Benefit. Management should give priority to those considerations which are most important to the success of their products. By definition, critical items are the product components with the greatest potential impact on the product's success. Making critical item control a program task formalizes and organizes action for the handling of critical items, for the review and guidance of management.

3.2.3 Timing. A critical item list should be created as soon as conceptual planning identifies product components. For example, when Lockheed decided that the SR-71 would be constructed of titanium, the known difficulty of working with the metal immediately made it a critical item. A critical item list should expand as the design process identifies more components and shrink as perceived problems are overcome during the Design/Development phase. In most cases, critical item control will extend through the Production/Manufacturing and Operation/Repair phases of the product life cycle.

3.2.4 Application Guidelines. Items are declared critical by the authority of the product development manager, for any of the reasons given in Table 4. When an item is declared critical, it is entered on a critical items list maintained by the product development manager or a delegated deputy. Responsibility is assigned by the manager to an individual or team to take action to minimize the risk in using the item. The responsible party will create and implement an action plan to remedy the cause of criticality. Table 4 shows some possible actions to counter some causes of criticality.

Table 4. Countering Criticality
Cause of Criticality Possible Corrective Actions
High cost Redesign item, change item manufacturing process, redesign product to eliminate item
Potential safety hazard Redesign to minimize hazard, add protective devices
Known difficulties (e.g., reliability problems, manufacturability, etc.) Find and eliminate root cause of problems, back-up with alternatives, redesign
Limited sources Develop more suppliers, use substitutes, improve yield of manufacturing process
No known substitutes Supplier control, develop alternatives
Technical uncertainty Testing, process control and improvement, back-up with alternatives
Special support needs (e.g., cryogenics, high voltage, etc.) Redesign to eliminate item, develop improved items without special needs, improve support system reliability

A periodic review of critical items should be included in every program schedule. Items should be removed from the critical item list only when the product development manager is satisfied that the item no longer poses a significant risk. All known critical items should be in reasonable control before a program transitions to the next program phase.

3.3 Environmental Characterization

3.3.1 Purpose. The reliability of a product is directly related to its intended use environment. Environmental characterization is the determination of the parameters of the use environment so that action can be taken to correct any factors which may degrade the inherent reliability of the product.

3.3.2 Benefit. As Table 5 illustrates, the effects of environmental stresses are generally predictable, and can be countered by effective product design. However, the design is not likely to be effective unless the use environment is known or sufficiently estimated through environmental characterization. Inappropriate design resulting from a lack of environmental knowledge may result in the adverse effects of Table 5, with subsequent degradation of inherent product reliability. The amount of degradation will be proportional to the mismatch between the product design and its environment, and can be disastrous. For example, a radio which performs well in a commercial airliner may be completely unsuitable for the harsher environment of a fighter plane, resulting in a high level of risk to mission success and personnel safety.

Table 5. Environmental Effects
Stress Effects Countermeasures
High temperature Parameter shifts, material softens, evaporation, outgassing, reduced viscosity, faster chemical reactions, materials expand Cooling systems, insulation, heat-resistant materials
Low temperature Embrittlement, cracking, parameter shifts, increased viscosity, ice formation, materials contract Heating systems, insulation, cold-resistant materials
Thermal shock Cracking, crazing, delamination, mechanical failures Stress relief, insulation, thermal buffers
Mechanical shock Mechanical failures, deformation, displacement Stronger parts, shock absorbers
Vibration Material fatigue failures, loosening, transient failures, increased wear Stiffer components, vibration mounts
Humidity Electrical shorts, enhanced oxidation, swelling Hermetic seals, protective coatings
Dryness Embrittlement, granulation, enhanced evaporation Humidifiers, seals, protective coatings
Salt spray Galvanic corrosion, degraded insulation, enhanced oxidation Seals, coatings, avoid dissimilar metals, non-metal components
Electromagnetic radiation Spurious signals, jamming, data corruption Shielding, grounding, frequency selection, resistant part types
Nuclear/cosmic radiation Damage to microcircuits, heating, data corruption, changes in parameters Shielding, component selection, error recovery methods
Sand/dust Scratching, friction, clogged openings, wear, abrasion, corona paths Filters, seals
Low air pressure Insulation breakdown, outgassing, container failures, loss of cooling effectiveness Pressurization, stronger containers, non-convective cooling, material selection

3.3.3 Timing. Environmental characterization should precede the Design/ Development phase so that the design will address the predominant environmental factors. In addition, environmental characterization should precede any application of the product which was not considered in its original design and development. For example, a CD player designed for home use should not be installed into an automobile without considering the effects of the more severe automotive environment on the product's reliability.

3.3.4 Application Guidelines. The best way to perform environmental characterization is to measure the use environment directly, where it is possible. There are a variety of measuring tools, from a crayon that changes color with temperature to a miniature Time Stress Measurement Device that will record a variety of environmental factors at a location on a printed circuit board.

When the environment cannot be measured directly, it must be estimated from available data. These might include measurements of similar environments, weather statistics, etc. Discussions with potential users can also be valuable inputs.

It is important to note that the end-use operating environment should not be the only consideration. A product is also subject to storage and shipping environments which have an impact on reliability. For example, a computer intended for laboratory use will likely encounter more severe stresses during transportation and storage than it will in use. These environments must also be factored into the design and packaging of the product to preserve inherent reliability.

To illustrate the range of stresses that a product may experience, Table 6 provides some values of environmental factors for different applications. Not all products will be exposed to the extreme values listed, but the data shows potential areas of concern. Besides the stresses listed in the table, there will be others dependent on application, such as salt spray, radiation, sand and dust. Space systems will experience high vacuum, cold temperatures to absolute zero, and solar radiation. Products used in military combat systems may experience extreme vibration or shock from gunfire. All products will have some transportation stresses.

3.4 Supplier Control

3.4.1 Purpose. Supplier control is intended to ensure that the reliability of purchased parts and components being used in a product remains good enough to permit the product to achieve its inherent reliability.

3.4.2 Benefit. Supplier control is the means of ensuring that the reliability of purchased components remains known and satisfactory despite possible changes in design and production processes between the original selection of the component and its use in the product.

3.4.3 Timing. Supplier control begins with the initial selection of vendors for the product components in the Design/Development phase, and continues throughout the life of the product. A partnership between parts suppliers and the product developer should be established when the decision is made to procure a part, and continue actively until the product reaches the end of its life cycle.

Table 6. Potential Environmental Stresses
Application Power Fluctuation Hot Extreme Cold Extreme Operating Range Vibration Shock Humidity
Fixed Ground ±7% 85°C -54°C 25°C
(A/C)
40°C
(Non A/C)
60°C
(Tropical)
- - -
Truck ±10% 85°C -54°C -40°C to
+55°C
5-200Hz 3.5G 11.2G (mounted)
18.5G (cargo)
0-100%
Tracked Vehicle ±10% 85°C -54°C -40°C to
55°C
5-500Hz 4.2G No data 0-100%
Truck Mounted Shelter ±10% 85°C -54°C -40°C to
+55°C
5-200Hz 3.5G 11.2G 0-100%
Man Pack +33% -20% 85°C -54°C -40°C to
+55°C
- - 0-100%
Shipboard External ±7% 65°C -50°C -32°C to
+48°C
1-50Hz
1G
- 100%
Shipboard Internal ±7% 65°C -50°C 0°C to
+50°C
1-50Hz
1G
- -
Submarine ±7% 65°C -50°C 22°C to
+25°C
1-50Hz
1G
- -
Small Craft ±10% 71°C -62°C -54°C to
+65°C
1-50Hz
1G
- 100%
Train ±10% 54°C -32°C -40°C to
+55°C
30-100Hz
2G
20G (mounted)
70G (cargo)
-
Airplane ±10% 71°C -54°C -30°C to
+71°C
3-1000Hz
5G
- 0-100%
Helicopter ±10% 85°C -62°C -30°C to
+65°C
3-500Hz
4G
- 0-100%
Missile ±10% 71°C -64°C -22°C to
+43°C
3-5000Hz 30G
15G
0-100%
Sources - MIL-HDBK-781 and "Vibration Analysis for Electronic Equipment", Dales, Steinberg, John Wiley & Sons, 1988.

3.4.4 Application Guidelines. Every process has inputs as well as outputs. The outputs of the process cannot be better than the inputs will allow. The reliability of a product will be constrained by the reliability of the components used by the manufacturer. It is imperative to ensure that the suppliers to the development process meet the needs of the product. There are many ways to attempt to control suppliers. One approach is the use of specifications and audits. For example, ISO 9000 is a series of international standards on quality system management. Companies are certified as compliant to ISO 9000 by independent auditors. This ensures to potential customers that there is a working quality system in place with the elements considered essential to produce a quality product. Unfortunately, this alone does not guarantee the reliability of the parts purchased from them. Even a part which has demonstrated high reliability in one application may fail miserably in another. The vendor and product developer should work together. Each possesses different assets, as shown in Table 7.

Table 7. Vendor and Product Developer Assets
Vendor Product Developer
Specialized test equipment Knowledge of intended use
Part failure history Application specific failure data
Knowledge of failure mechanisms Environmental data
Control of part production process Control of product production

These complementary assets make close collaboration necessary for a successful product development. This is more likely to succeed when the product developer has a long term relationship with a small number of suppliers, rather than "arms length" transactions with a host of vendors. Some manufacturers fear the possible loss of vendors, but a long term relationship is likely to help ensure that the chosen vendors will have enough business to survive.

The partnership between the product developer and supplier begins with selection of the supplier. Besides performance, form, fit, function, and price of the purchased item, inherent reliability in the planned use environment should also be a factor in selection. However, to ensure achieved reliability will be satisfactory, the initial selection should consider the following:
  • Does the supplier consider reliability as a critical part parameter?
  • Is reliability data on the component maintained?
  • Is sufficient analysis performed to find the root causes of failures of the component?
  • Is variability of the supplier's manufacturing processes controlled (e.g., through Statistical Process Control)?
  • Are all the parameters critical to the application cited in the component specification?
  • Does the supplier know the underlying failure mechanisms affecting the component?
  • Does the supplier attempt to obtain reliability data from its customers?
  • Is the impact on reliability considered by the supplier before making design or process changes?
  • Is there a willingness to share reliability information on components with the product developer?
Affirmative answers to these questions will ensure a good start to the partnership. The product manufacturer will need to define for the supplier the intended use and expected use environments. The planned manufacturing process will also be of interest. For example, if the component will be soldered into the product, details of the soldering process should be provided so the supplier can determine if it poses any hazard to the component.

Once a supplier is selected, each party should assure that they create no surprises for the other. Suppliers should advise the product developer of any significant changes in design or manufacturing processes that could affect form, fit or function, and the product developer should advise the supplier of any significant changes in the intended use or manufacturing environment.

During production and use of the product, details of all failures of the component should be fed back to the supplier. Failed components may also be returned for failure analysis by the supplier. The root cause of every failure should be determined by a cooperative effort and corrective action taken by either party, or both as necessary, to prevent its recurrence.

A product developer should spend some time with new suppliers to gain confidence that they know his needs. After the initial groundwork, a minimum of effort should be needed to keep the supplier base current. The net result will likely be that long term relationships will take less effort and produce better results.

3.5 Design of Experiments

3.5.1 Purpose. Design of Experiments (DOE) is a statistical approach to identify and improve factors that impact product performance. It can be used to determine which production process factors impact the achievement of reliability, and the required values of these factors to ensure adequate product reliability.

3.5.2 Benefit. It is often not obvious what factors impact reliability, by how much and how these factors may interact. For example, temperature and solder wave height may impact the reliability of printed circuit boards going through a wave solder machine. How significant each factor is, and at what values should they be set, are questions which require some testing to answer. A properly tailored DOE can address these questions in the most economical way by determining the contribution of, and any interaction between, factors which may degrade the inherent reliability of the product without testing all possible combinations of factors.

3.5.3 Timing. Design of Experiments can be done at any time in the product life cycle. However, there will be no benefit unless there is an opportunity to take advantage of the results. The most benefit will obviously come from experiments conducted before the product is finalized or the process is implemented. On the other hand, a clear idea of the product or process is necessary to avoid irrelevant testing. For example, testing at inappropriate stress levels, or incorrect identification of the critical product characteristics, will potentially result in process decisions based on bad information.

3.5.4 Application Guidelines. A general procedure for DOE is shown in Figure 2. As an example, a procedure to develop a DOE for a wave solder machine will be used.

Figure 2. DOE Process
Figure 2. DOE Process (Click to Zoom)

Select Factors. While it is important to test all factors which are significant, each additional factor tested adds cost. It is good practice to convene a team of those people most familiar with the process and have them brainstorm a list of potential contributing factors, then analyze them and recommend a short list for test.

The factors to be tested should directly relate to the output parameter of interest. If the primary concern was the resonant frequency of the circuit on the board after soldering, a different set of factors would be chosen than if the main concern was reliability. Given a concern for reliability, the output parameter of interest might be failure rate, but this might require too lengthy a test to be practical. If the defect rate (i.e., the number of solder defects per board) can be considered as an indication of operational reliability, then the factors which impact defect rate should be selected as the DOE parameters of interest.

For this example, assume that a team has decided that temperature, wave height, flux and cleanliness of the board are the important factors. Note that cleanliness of the board may not be under the control of the solder process operator.

Select Test Levels. The experiment will consist of a series of tests, during which each factor must be set to at least two values. It is possible to use more than two values for any factor (e.g., test at five different temperatures), but each additional factor tested will significantly lengthen the experiment and add complexity to the analysis. The two values selected should be far enough apart so that the difference in their impacts can be observed, but close enough so that the difference is approximately linear with the change in value.

Once the settings are selected, they are coded, which permits some computational shortcuts. One way is to assign one setting the code value "plus" (+) and the other, "minus" (-). For example, if solder temperature test points of 400 and 380 degrees have been selected, one temperature would be coded as "plus" (it does not matter which) and the other as "minus". For the presence and absence of flux, the presence might be coded "plus" and the absence, "minus" (or vice versa). The test matrix will be set up using the coded values. For this example, the test settings in Table 8 have been used.

Table 8. Test Settings
Factor "Plus" Setting "Minus" Setting
Temperature 400 degrees 380 degrees
Wave height 12mm 10mm
Flux present absent
Cleanliness clean boards dirty boards

Set Up a Test Matrix. There are various ways to set up a matrix of tests. One of these is the orthogonal array, which is simply a tool for structuring a test so that the effects of the test factors can be easily separated. Suppose it has been decided to test for the effects of temperature, wave height, and flux. A full factorial orthogonal array for three factors is diagrammed in Table 9.

Table 9. Full Factorial Orthogonal Array
Test Tested Factors Inferred Factors Results
A B C AB AC BC ABC
1 + + + + + + +  
2 - + + - - + -  
3 + - + - + - -  
4 - - + + - - +  
5 + + - + - - -  
6 - + - - + - +  
7 + - - - - + +  
8 - - - + + + -  

Table 9 calls for eight tests. The three columns on the left represent the test factors (for the example, A will be temperature, B will be wave height and C, flux). Test number one would be run with all factors at the settings designated as "plus". Test number two would be run with temperature at the value designated as "minus" (i.e., 380 degrees) and the other factors at the "plus" values. The remaining tests would be run in a similar fashion.

The other columns are by-products of the test settings used in the first three columns, and represent the interactions between the factors tested. These inferred settings are orthogonal with the test settings and with each other, allowing an easy determination of the effect of any interactions.

There are two important variations of the test matrix. A "saturated array" permits economy of testing when it can be reasonably assumed that there will be no interactions of interest. Table 9 shows that eight tests are needed for a full factorial experiment with three factors. Only four tests would be necessary in a saturated array, since one of the orthogonal columns ordinarily needed to calculate the interaction effects can be used instead to determine settings for one of the factors, as shown in Table 10.

Table 10. Saturated Array
Test A B C Results
1 + + +  
2 - + -  
3 + - -  
4 - - +  

The second variation is used to examine the effects of uncontrollable factors. In this variation, results are obtained under different values of the uncontrolled factor (either by arranging control for the experiment or waiting for the desired value to become available). A common practice is to separate the uncontrolled factors from the controlled factors by putting the controlled factors in an "inner array" and the uncontrolled factors in an "outer array" as shown in Table 11.

Table 11. Array Including Uncontrolled Factors
Test Controlled Factors (Inner Array) Results (Outer Array)
A B C D = + D = -
1 + + +    
2 - + -    
3 + - -    
4 - - +    

Run the Tests. Using Table 11 as a test matrix, four different tests would be performed, setting the temperature to either 380 or 400 degrees, the wave height to 10 mm or 12 mm, and using flux or not, in accordance with the array of Table 11, and noting the cleanliness of the board. Table 12 gives some hypothesized results, where the results might be measured in defects per hundred boards.

Table 12. Sample Test Results
Test A B C D = + D = -
1 + + + 0.3 0.3
2 - + - 1.1 1.6
3 + - - 0.7 1.2
4 - - + 1.9 1.9

Analyze the Results. The results of the tests are analyzed using linear regression techniques. The average result (test outcome) when a factor is set to its "plus" value is computed from all the results of the tests run with that factor at its "plus" value. From this is subtracted the average result when the factor was set to its "minus" value. The result of the subtraction, called delta (Δ), is the average difference in the value of the test outcome as the factor varies from "minus" to "plus". Assuming a linear relationship, this result can be used to predict the result of setting the factor at any value between "plus" and "minus". As an example of how this is done, the data in Table 12 will be used to analyze the test data applicable to clean boards only, as presented in Table 13.

Table 13. Partial Analysis of Test
Test A B C Results
1 + + + 0.3
2 - + - 1.1
3 + - - 0.7
4 - - + 1.9
Avg. + (0.3 + 0.7) / 2 (0.3 + 1.1) / 2 (0.3 + 1.9) / 2 y = (0.3 + 1.1 + 0.7 + 1.9) / 4
y = 1.0
Avg. - (1.1 + 1.9) / 2 (0.7 + 1.9) / 2 (1.1 + 0.7) / 2
Δ -1.0 -0.6 +0.2  
y = y + (ΔA / 2)A + (ΔB / 2)B + (ΔC / 2)C
y = 1.0 - .5A - .3B + .1C

Calculate optimum settings. Since the test results in the example are defect rates, and the lowest defect rate is the desired output, the factors would be set to the value between "plus" and "minus" that results in the smallest value for Y, the defect rate impacted by the factors. Factor C, the presence or absence of flux, can take only the values "plus" (present) or "minus" (absent). The other factors can take any value between "plus" and "minus", representing settings between the high and low values selected for the test. From the equation derived in Table 13, the optimum settings would be "plus" for factors A and B and "minus" for factor C. The expected defect rate at these settings would be: Y = 1.0 -.5 -.3 -.1 = 0.1. Note that this result is lower than any of the test results shown in Table 12.

Instead of defect rates, the test results could have been some parameter for which there is a desired nominal value. For example, the experiment result could have been the ratio of the electrical impedance of the board to a desired value. In this case, the settings should be optimized to produce an output of 1.0. There are various ways to do this. For example, A = .2, B = 0, C = -1.0, representing a temperature setting of 392 degrees, which is corresponds to the 0.2 point on a linear scale from "minus" to "plus", (where "minus" represents 380 degrees and "plus" represents 400 degrees). Similarly, wave height is set to 11mm, corresponding to the zero point on the scale, where "minus" represents 10mm and "plus" represents 12mm. "No flux used" is represented by the "minus" point on the scale for factor C. Other solutions also exist.

So far, the factor of cleanliness has not been addressed. Cleanliness is believed to impact the defect rate of the board, but is out of the control of the process operator. To handle such factors, solutions are needed which are robust (i.e. useful over the range of expected conditions), rather than optimum for the nominal conditions. The analysis so far shows that the optimum solution to reduce defects was to not use flux in the solder, which resulted in a defect rate of 0.1. However, Table 14, the complete analysis, shows that using flux reduces the defects when the boards are dirty, and results in a defect rate of 0.3 with either clean or dirty boards when the other two factors are at their "plus" settings. If a process constraint is to operate with both clean and dirty boards, and both are common, the relatively slight degradation over the optimized defect rate for clean boards may or may not be a reasonable sacrifice for a robust solution that is good for both clean and dirty boards (i.e., using flux), depending on business considerations.

Table 14. Complete Analysis of Test
Test A B C D = + D = -
1 + + + 0.3 0.3
2 - + - 1.1 1.6
3 + - - 0.7 1.2
4 - - + 1.9 1.9
when D = +
      y = 1.0 + (-.5) A + (-.3) B + (0.1) C
when D = -
      y = 1.25 + (-.5) A + (-.3) B + (-.15) C

Run Confirmation Test(s). There is always a danger that the test results reflect the influence of an unknown factor present during the tests. For this reason, it is always good practice to run a confirmation test at the optimized test setting to verify that the expected results are indeed achieved. This is especially important when a saturated array is used under the assumption that interactions between factors are not significant. If the assumption is wrong, a verification test should not give the expected results, and the analyst will know that more work needs to be done.

3.6 Failure Modes, Effects and Criticality Analysis (FMECA)

3.6.1 Purpose. Failure Modes and Effects Analysis (FMEA), and Failure Modes, Effects, and Criticality Analysis (FMECA) determine the effects of individual failure modes of every part or function in a product, or step in a process. With this knowledge, the most undesirable effects can be mitigated by redesign and appropriate maintenance procedures can be formulated to avoid undetected failures and reduce downtime, thus ensuring more reliable performance of the product for the customer.

3.6.2 Benefit. The systematic nature of FMEA and FMECA ensures that every part or function in the product, or step in a process, is considered in determining the effects of failure on the product or process. This comprehensive knowledge is the basis for ensuring reliable performance. Criticality analysis is designed to determine the relative priorities of proposed changes, thereby helping to ensure that inherent reliability is not degraded.

3.6.3 Timing. Because it is comprehensive, a part-level FMEA or FMECA can only be done when the parts list is complete (Design/Development phase). Prior to that, a functional FMEA/FMECA can be performed to assess the effect of functional failures on overall product performance. However, in order to permit corrective action, the analysis should be done before the design is finalized. Similarly, process FMEA should await process definition during the Concept/Planning and Design/Development phases, but must be done before the process is fixed. A software FMEA should await software flow charting to determine the modules, but should precede coding. In all cases, the role of FMEA/FMECA in ensuring reliable product performance is to update the analysis to reflect current design/process configurations as a means to preclude the introduction of failure modes that may ultimately degrade the inherent reliability of the product.

3.6.4 Application Guidelines. Failure Modes and Effects Analysis (FMEA) and Failure Modes, Effects and Criticality Analysis (FMECA) employ bottoms-up logic in a systematic fashion to determine the effects of failures of the parts making up a system. There are also process FMEAs in which the parts list is replaced by a list of process steps, and software FMEAs in which the parts list is replaced by a list of software modules. The results of the analysis can identify reliability shortcomings that must be corrected. FMEAs performed during the original design help establish the inherent reliability of a product. Proposed design changes should be analyzed by revised FMEAs to ensure the inherent reliability is not degraded by the proposed changes. An FMECA is an FMEA with the addition of a criticality analysis used to set priorities for corrective action. Though systematic, FMEA does not include the effects of any factor other than parts (or the process or software equivalents). Human errors and external causes of failure are not considered.

The results of an FMEA provide a list of recommended corrective actions from a list of parts, as shown in Figure 3. This is true whether the parts are resistors, microcircuits, computers or printers, or process steps like soldering and cleaning, or software code modules. One approach is described in the following list of tasks. Each task results in a column of data to be entered on a worksheet. The data needed and the worksheet organizing it will change with the user's needs. For this illustration, the worksheet in Figure 4 would be appropriate.

Figure 3. Generic FMEA Approach
Figure 3. Generic FMEA Approach (Click to Zoom)


Parts
List
Function Failure
Modes
Effect Severity Recommended
Action
Local End
 























Figure 4. Generic FMEA Worksheet

  • List the Parts/Functional Blocks. The left hand column of the FMEA worksheet is a complete list of the parts or functions comprising the product of interest.
  • List the Function of Each Part/Functional Block. In order to determine the effects of part failures, it is essential to know what the parts are intended to do.
  • List the Failure Modes of Each Part/Functional Block. For example, a resistor failure might be a short circuit, an open circuit or a change in resistance. Each of these may have a different effect on product performance.
  • Determine the Local Effect of the Failure Mode. For example, suppose we plan to add a microprocessor controlling input data to a product. One "stuck" microprocessor pin may result in wrong data put on line, while another "stuck" pin might lock the address bus.
  • Determine the End Effect. Continuing the above example, wrong data on line may result in incorrect sums on checks printed by a payroll system or garbled messages in a communications system, while a locked address bus might result in a system crash.
  • Determine the Severity. In a payroll system, a system crash may be preferable to the issuance of incorrect checks, while having some garbled messages may be preferable to a crash of a communications system. Hence, some measure of severity is useful in setting priorities for corrective action. This can be a ranking on a scale (e.g., a number between one and ten with one representing a negligible effect and ten representing a catastrophe). One procedure defines four categories of Severity as follows:
Category Number Category Description Category Characteristics
1 Catastrophic Causing death or physical damage to product/other equipment
2 Critical Causing severe injury, major property damage, or loss of product performance
3 Marginal Causing minor injury, minor property damage, or degradation of product performance
4 Minor Causing only unscheduled product replacement or repair
  • Recommend Corrective Action. The last column of an FMEA lists the recommended appropriate actions. This can be as simple as a statement that the proposed microprocessor should not be added to the product or as complex as a recommendation that error detection and correction capability be added to the circuitry.
Criticality Analysis. While severity is useful in setting priorities, it considers only the effect of a failure. Criticality analysis adds one or more dimensions to this. For example, one could consider both the severity and the probability of occurrence in calculating the criticality of an effect. The probability can be calculated if the part failure rate, the relative frequency of the different failure modes and the conditional probability that the failure mode will cause a systems failure, can be reasonably estimated. Otherwise, the failure mode can be assigned to one of the following levels:

Criticality Level Criticality Occurrence Criticality Probability
A Frequent High probability of occurrence (>0.20) during product use
B Reasonably Probable Moderate probability of occurrence (>0.10, <0.20)
C Occasional Occasional probability of occurrence (>0.01, <0.10)
D Remote Unlikely probability of occurrence (>0.001, <0.01)
E Extremely Unlikely Probability of occurrence is essentially zero (>0.001)

The priority of corrective actions can be set by plotting the severity category against the probability of occurrence on a chart such as the one shown in Figure 5. The left hand scale can be a calculated probability of occurrence, or the level of probability of occurrence defined above. In either case, failure modes closest to the upper right hand corner of the chart are considered the most critical and should be assigned the highest priority for resolution.

Existing commercial FMECA procedures often calculate criticality by computing a risk priority number (RPN). This is the product of three figures called severity, occurrence and detectability. Severity is defined on a scale of one to ten with one representing a minor effect and ten representing a catastrophe. Occurrence is defined on a similar scale with one representing something that almost never happens and ten representing something that happens quite often. Detectability refers to the likelihood of noticing the onset of the effect. For example, a blowout due to tire wear should never happen because the user can see how much tread is left. Detectability is also rated on a scale with one representing effects that should never come as a surprise, and ten representing effects that are almost always a complete surprise. The three numbers are multiplied together, giving each failure mode a RPN between one and 1000. The higher the number, the greater the priority.

Figure 5. Determination of Corrective Action Priority
Figure 5. Determination of Corrective Action Priority (Click to Zoom)

RPN is used in a FMEA standard published jointly by Ford, General Motors and Chrysler. A worksheet from the standard is shown in Figure 6.

Figure 6. Automotive Industry FMEA Worksheet
Excerpt from "Figure 6. Automotive Industry FMEA Worksheet" See Full Version


3.7 Failure Reporting, Analysis and Corrective Action System (FRACAS)

3.7.1 Purpose. A failure reporting, analysis and corrective action system (FRACAS) is the backbone of reliability assurance. It provides the data needed to identify deficiencies for correction to ensure that inherent reliability is not degraded.

3.7.2 Benefit. FRACAS provides information needed for the timely identification and correction of design errors, part or process problems or workmanship defects. All of these deficiencies preclude the achievement of the inherent reliability potential in the design, with the potential impact on cost that this entails. There can be significant direct costs in factory rework, scrap, or warrantee service, and even greater indirect costs in reduced market share.

3.7.3 Timing. FRACAS requires a source of data before it can be implemented. Once hardware/software begins to become available, and definition and implementation of processes has begun, a working FRACAS should be in place and failure data collected from any tests and normal product use (Design/Development through Production/Manufacturing). The FRACAS should remain in use as long as the product is being supported by the manufacturer (i.e., through the Operation/Repair phases of the product). Customers may, and should, have their own FRACAS to identify operational reliability problems for correction during their use of the product.

3.7.4 Application Guidelines. An ideal FRACAS is shown in Figure 7.

Figure 7. An Ideal FRACAS
  1. Observation of the failure
  2. Complete documentation of the failure, including all significant conditions which existed at the time of the failure
  3. Failure verification, i.e., confirmation of the validity of the initial failure observation
  4. Failure isolation, localization to the lowest replaceable defective item within the product
  5. Replacement of the suspect defective item
  6. Confirmation that the suspect item is defective
  7. Failure analysis of the defective item
  8. Data search to uncover other similar failure occurrences and to determine the previous history of the defective item and similar related items
  9. Establishment of the root cause of the failure
  10. Determination, by an interdiscipline team, of the necessary corrective action, especially any applicable redesign
  11. Incorporation of the recommended corrective action into development products
  12. Continuation of development tests
  13. Establishment of the effectiveness of the proposed corrective action
  14. Incorporation of effective corrective action into production equipment
Figure 7. An Ideal FRACAS


Critical to a successful FRACAS is its database. It is particularly important in establishing the significance of a failure. For example, the failure of a capacitor in a reliability growth test becomes more important if the database shows similar failures in incoming inspection of the part and in the environmental tests performed. A pattern of failures shows that there is a reliability problem which will preclude achievement of the inherent reliability unless it is corrected. For this reason, all available sources of data should feed the FRACAS. Initial failure reports should document, as applicable:
  • Location of failure
  • Test being performed
  • Date and time
  • Part number and serial number
  • Model number
  • Failure symptom
  • Individual who observed failure
  • Circumstances of interest (e.g., occurred immediately after power outage)
The failure documentation should be augmented with the verification of failure at the product level (item 3 in Figure 7), and verification that the suspect part did indeed fail (item 6). The number and format of the failure reporting form should be determined by the producer to best meet its needs and any requirements which may be dictated by the customer.

Once the failure is isolated, the FRACAS database and failure analysis can be used to determine its root cause. Given the root cause, appropriate corrective action can be formulated, implemented and verified.

As an example, suppose an output transformer fails during the final test of an audio amplifier. The product failure is verified (a real problem exists) and the replaced part (the transformer) checks bad (the suspect part has indeed failed). Failure analysis shows that the cause of failure was a broken wire. A data search reveals similar failures have occurred in other units. Hence, the inherent reliability of the amplifier will be compromised until appropriate corrective action eliminates the root cause of the failure. The root cause could be the workmanship in installing the transformer or poor stress relief built into the part. The former can be corrected only by the amplifier manufacturer and the latter only by the audio transformer supplier. Physical analysis of the failure would be used to determine which is the root cause, and help reveal what can be done to correct the problem.

Failure analysis can be performed to various degrees, and usually requires some cooperation with the part supplier. The most critical failures (i.e., those that occur most often, are most expensive to repair, or threaten the user's safety) should receive the most in-depth analysis, perhaps including X-rays, scanning electron beam probing, etc., which typically requires specialized equipment. Where the producer does not elect to create a comprehensive failure analysis laboratory, outside independent laboratories can be utilized for these techniques.

3.8 Parts Obsolescence

3.8.1 Purpose. The parts used in a product are purchased to specifications designed to ensure their reliability, and from suppliers who can produce parts with the desired reliability. However, it often becomes unprofitable for a part supplier to continue production of a particular product line. When this happens, continued production of products using the parts and replacement of failed parts may require the use of parts with lesser reliability, resulting in a degradation in the inherent reliability of the products using them. In extreme cases, replacement parts at reasonable cost may not be available. Attention to parts obsolescence is intended to avoid such problems.

3.8.2 Benefit. By considering parts obsolescence as part of the overall product life cycle planning, it is possible to avoid the significant trouble and expense entailed in searching for replacement parts. The need for a replacement part that is no longer available on the market can be satisfied relatively cheaply and quickly when solutions to obsolescence are in place, or it can be addressed by time consuming and expensive crisis management actions when the unavailability of the part comes as a surprise.

3.8.3 Timing. Parts management starts in the Concept/Planning phase with a preferred parts list (PPL), which provides a description of parts which the designers may use. No other parts should be allowed, except when it is impossible to meet a performance/reliability requirement with the parts of the PPL. The PPL should be updated before each application to a new product development. This update should consider the obsolescence of all parts listed. Those which are likely to become difficult to obtain should be removed from the list. This action should be followed by the determination of an appropriate action to ensure that products currently using the part can continue to be supported.

Preferred parts lists should be reviewed periodically and individual parts listed should be re-evaluated at any sign of obsolescence (manufacturers discontinuing a production line, introduction of newer technology with significant advantages, feedback from buyers reporting difficulty with spare parts purchases, etc.).

3.8.4 Application Guidelines. The explosive advance of technology has resulted in component parts that are smaller, lighter, cheaper, more capable and more reliable. One bad effect of this is that every technological advance reduces the market for older parts, which ultimately go out of production. As a result, it may become impractical to obtain replacement parts for failed components of a product in use. Attention to parts obsolescence is meant to lessen the impact of diminishing parts availability.

There are many possible remedies to part obsolescence problems when they are identified early. The options decline as time passes. Some are:
  • Lifetime buy. When it is reasonable to assume the availability of a part will soon decrease, a good strategy might be to purchase enough spares to last the expected lifetime of the product. This assumes that there will be no degradation of parts in storage.
  • Substitution. If a newer part can be purchased with the same form, fit and function of the obsolete part, it can be directly substituted. The impact of the part substitution on inherent product reliability should still be assessed, however, to ensure reliable performance.
  • Redesign. To avoid the need for an obsolete part, a redesign of the product to eliminate it can be performed. This should be done at the lowest possible level of assembly, i.e., a board rather than an assembly of boards, an assembly rather than a module of many assemblies, etc. The new design can then be retrofitted to the product when the obsolete part fails. If the new design has other benefits (e.g., faster speed, more memory, etc.), it may be desirable to retrofit the product before the part fails, as a performance upgrade. The effect of the redesign on inherent product reliability should still be evaluated to avoid the potential of no longer meeting the customer's reliability needs.
3.9 Repair Strategy

3.9.1 Purpose. When a product fails, it is desirable to restore it to operation in a fast and economical manner. It is also important that the repair activity does not degrade the inherent reliability of the product. To achieve these ends, it is necessary to formulate an appropriate repair strategy.

3.9.2 Benefit. The painstaking effort to produce a reliable product can be lost if defects are introduced in the maintenance process. There are many ways in which this could happen. If maintenance requires a higher repair personnel skill level or more powerful test equipment than is actually available, attempts at repair may do more damage than good. A lack of guidance or inadequate repair procedures may cause maintenance errors that introduce latent defects into the product. Components designed for replacement at failure, exposed to "abnormal" repair scenarios, may be difficult to repair without inducing significant damage. A well conceived repair strategy will attempt to preclude the degradation of reliability, as well as provide the fastest and most economical restoration of service.

3.9.3 Timing. Repair strategy should be one of the first considerations used in the planning and design of a product. Therefore, it is one of the first efforts in the Concept/Planning phase of product development. It can be based on market surveys to determine the customers needs, and should be redone if the needs change. Repair strategy and product design should be compatible.

3.9.4 Application Guidelines. The repair strategy should be formulated to respond to the basic questions:
  • Who? Who will be doing the repairs and what are their skill levels? The repair strategy should not require higher repair skills than those available, or the repair process will likely degrade reliability. For example, if soldering is required and the maintenance technician is not skilled at soldering, bad solder joints and parts damaged by overheating will be introduced into the product. A repair strategy for unskilled technicians could include repair by replacement of plug-in modules to reduce handling, built-in-test to eliminate the need to troubleshoot, and expert systems to guide the repair actions of the technician.
  • Where? Will repairs be done at the customer's site, the producer's plant or a third party location? What resources should they be expected to have? For example, an oscilloscope may be needed for tuning a radio after the repair of oscillator circuitry, but if the repair site does not have one, mis-tuned radios may be returned to customers. To prevent this, an oscilloscope may be built into the product, some other means of tuning used, or the need for tuning eliminated (by replacing failed oscillators with pre-tuned modules, for example).
  • How? Will the repair require special tools or skills? Will a maintenance manual be included with the product? The need for special tools should be avoided, as a lost tool means the product may be damaged during repairs made using improper tools. Note that a tool or skill not considered special by some customers, may be special to others. For example, an automobile product intended for repair in a commercial garage (e.g., a door panel) may call for the use of a torque wrench, but one intended for repair by the automobile owner (e.g., wheel replacement) should not. The maintenance manual, if any, should match the skills of the user and the tools available.
  • What? Will components be designed for replacement or repair? At what level of assembly will replacement be preferred? Is this consistent with the customer's needs? When products are designed to be repaired by module replacement, but are used by customers who repair the modules rather than replace them, the achieved reliability is almost invariably degraded through induced damage. Such cases often arise when the user cannot wait for a replacement module to resume operation. Solutions include the encapsulation of the modules to preclude repair, on-site spares to permit continued operation of the product while awaiting replacement parts (including the provision of built-in spare modules), provision for expedited spares delivery (i.e., just-in-time), or the design of modules for repair by the available technicians and tools.
  • When? Is preventive maintenance needed? How often? When should periodic inspections be performed, if appropriate? The wearout of mechanical products and failures of electronic products that are not obvious (e.g., the corruption of data) can result in poor operational reliability. In non-critical cases these situations may be found by periodic inspection. For critical applications, means should be provided to make repair needs obvious. For example, vibration monitors can identify the extent of wear in some machines, and built-in-test or error detection circuitry can identify "implicit" failures in electronic products. Preventive maintenance schedules should fit into the customer's schedule. If not, preventive maintenance may be ignored, with resulting damage further degrading the achieved product reliability.
3.10 Test Strategy

3.10.1 Purpose. Test strategy is an established strategic plan for cost effectively performing test measurements that add value to a particular product for its customers. Test strategy typically encompasses all testing done on a product, including those tests that will be used for ensuring product reliability.

3.10.2 Benefits. Even in the best programs, achieved reliability may be inadequate because of reliability problems which can be detected only through the failures that are caused during actual product use. Elimination of these problems before delivery of the product to its customers may require testing at the producer's plant. Testing is not a trivial expense. On the other hand, should corrective action be needed, a timely measurement can make the difference between an economical fix and one that is expensive or not feasible.

3.10.3 Timing. Initial program planning during the Concept/Planning phase should include a test strategy tailored to ensure reliability of the product. As the program progresses into Design/Development, changes in the program (e.g., a decision to develop an item rather than buy it off-the-shelf) should be reflected in changes to the test strategy. Every product design or process review should include a conscious decision to retain or revise the test strategy.

3.10.4 Application Guidelines. Tests to ensure reliable performance may include:
  • Environmental Stress Screening (ESS). These are tests that are performed to detect and remove workmanship and component defects before products are delivered to the customer.
  • Production Reliability Acceptance Tests (PRAT). These tests are performed to ensure that product reliability does not degrade during the production process.
The specific tests, and the motivation for applying them, depend on the circumstances of the program. The matrix of Table 15 relates program and product circumstances to their expected impact on the value of reliability test techniques to ensure reliable product performance.

Table 15. Test Strategy Planning Matrix to Ensure Reliable Performance
Reliability Test Technique Program/Product Circumstances
New Dev. COTS Safety Critical Dormancy Long Life Harsh Env. S/W Dev.
ESS ? + + ? ? + -
PRAT ? ? + ? ? ? ?

A "plus" sign (+) indicates that the activity offers value to the program under that circumstance. A "minus" sign (-) means that the activity is probably not cost effective for that circumstance. A "question mark" (?) indicates that the activity may or may not add value for that circumstance, depending on the type of product. The circumstances considered are New Development (i.e., a product to be designed and built for the first time), COTS (an item available as a Commercial Off-the-Shelf product), Safety Critical (e.g., a nuclear plant control system), Dormancy (i.e., an item to be subjected to long periods in storage or otherwise unpowered), Long Life (an item likely to be in service for a relatively long time, such as the B-52 Aircraft), Harsh Environment (high shock, rapid thermal cycling, et. al.), and S/W (Software) Development.

3.11 Environmental Stress Screening

3.11.1 Purpose. Latent defects in a product (e.g., weak parts, faulty workmanship, design flaws), will become failures under stress. Environmental Stress Screening (ESS) prevents the defects from degrading the inherent reliability of products delivered to the customer by subjecting the product to a regimen of stress designed to cause the defects to become failures at the factory, which can then be repaired before delivery.

3.11.2 Benefit. ESS is a way of ensuring or approaching the inherent reliability of a product in which defects are present. Other methods, such as a failure reporting, analysis and corrective action system (FRACAS), design of experiments (DOE), and statistical process control (SPC), are intended to prevent defects in the product by finding and eliminating the root causes. Until the root causes of the defects are eliminated, an effective ESS program is critical to making possible the shipment of defect-free products.

3.11.3 Timing. ESS is performed during the Production/Manufacturing phase. However, ESS can be applied not only to the manufacture and shipment of the final product to an external customer, but also to those components received by the manufacturer from its suppliers that are to be installed into the product.

3.11.4 Application Guidelines. ESS can be beneficial when performed by the parts vendor, the printed wiring board assembler, the subassembly producer, and/or the final product integrator. There are two factors which should be considered for using ESS:
  • At lower levels of product complexity, repairs are easier and less expensive. A failure during a part-level test is the least costly. For this type of test, it is possible to subject the part to higher stresses than it would receive when it is combined with other components that may not tolerate equivalent stresses at an assembly level. This makes ESS at low levels of assembly desirable. However -
  • Each level of assembly introduces defects. Solder errors, faulty assembly, insufficient tolerances, etc. should be found and corrected before delivery of the product. Hence, ESS is also desirable at progressively higher levels of assembly.
If the causes of all defects have been eliminated, ESS is no longer necessary, and may be stopped. An indicator of this is the lack of fallout (i.e. failures) during a properly designed ESS process. However, the lack of fallout could also indicate that the ESS is not effective in finding the type of defects present in the product. In this case, a change in the applied stresses is needed, preferably based on the analyzed characteristics of the end-use (i.e., customer) environment.

There are many guides to ESS, most derived from work done by the Institute of Environmental Sciences. The charts in Tables 16-18 are taken from the DoD Tri-Service Technical Brief 002-93-08, Environmental Stress Screening Guidelines.

Table 16 provides general guidance by comparing costs, risks and results of screening at various levels of assembly. ESS at the part level is not included.

Table 16. ESS Guidelines
ESS Conditions/Trade-offs Risks/Effects
Level of Assembly Power Applied1 I/O2 Monitored3 ESS Cost Technical Comments
YES NO YES NO YES NO Risk Results
Temperature Cycling
PWA   X   X   X Low Low Poor Conduct pre & post ESS functional test screen prior to conformal coating.
X     X   X High Lower Better
X   X   X   Highest Lowest Best
Unit/Box X   X   X   Highest Lowest Best If circumstances permit ESS at only one level of assembly, implement at unit level.
X     X X X Lower Higher Good
  X   X   X Lowest Highest Poor
System X   X   X   Highest See Comment Most effective ESS at system level is short duration random vibration to locate interconnect defects resulting from system integration.
Random Vibration
PWA X   X   X   Highest Low Good Random vibration is most effective at PWA level if: 1. Surface mount technology is utilized 2. PWA has large components 3. PWA is multilayer 4. PWA cannot be effectively screened at higher assemblies
X     X X   High High FAir
  X   X   X Low Highest Poor
Unit / Box X   X   X   Highest Low Good Random vibration most effective at this level of assembly. Intermittent flaws most susceptible to power-on with I/O ESS. Power-on without I/O reasonably effective. Decision requires cost benefit trade-off.
X     X X   High High Fair
  X   X   X Low Highest Poor
System X   X   X   Low Low Good Cost is relatively low because power and I/O normally present due to need for acceptance testing.
Notes:
1. Power applied - at PWA level of assembly, power on during ESS is not always cost effective
2. I/O - equipment fully functional, with normal inputs and outputs
3. Monitored - monitoring key points during screen to assure proper equipment operation

The ESS program of a product developer should be tailored to meet the objectives of the reliability program. For those manufacturers lacking sufficient data to guide ESS tailoring, the Tri-Service Technical Brief provides some ESS baseline profiles. These are not recommended values, but rather starting points for the development of unique profiles for individual products. Table 17 presents a baseline thermal cycling profile, and Table 18, a baseline vibration profile.

Table 17. Baseline ESS Thermal Cycling Profile
Characteristic1 Level of Assembly
PWA2 Unit3 System
Temperature range of product -50°C to +75°C -40°C to +70°C -40°C to +60°C
Temperature rate of change of product4 15°C/Minute to 20°C/Minute 10°C/Minute to 20°C/Minute 10°C/Minute to 15°C/Minute
Stabilization Criterion Stabilization has occurred when the temperature of the slowest-responding element in the product being screened is within 15% of the specified high and low temperature extremes. Large magnetic parts should be avoided-when determining that stabilization has occurred.4
Soak time of product at temperature extremes after stabilization
- If unmonitored 5 Minutes 5 Minutes 5 Minutes
- If monitored Long enough to perform functional testing
Number of cycles 20 to 40 12 to 20 12 to 20
Product Condition5 Unpowered/Powered Powered, Monitored Powered, Monitored
Notes:
  1. All temperature parameters pertain to the temperature of the unit being screened and not the chamber air temperature. The temperature parameters of the unit being screened are usually determined by thermocouples placed at various points on the unit being screened.
  2. PWA guidelines apply to individual PWAs and to modules, such as flow-through electronic modules consisting of one or two PWAs bonded to a heat exchanger.
  3. Unit guidelines apply to electronic boxes and to complex modules consisting of more than one smaller electronic module.
  4. It is up to the designer of the screening profile to decide which elements of the hardware (parts, solder joints, PWAs, connectors, etc.) must be subjected to the extreme temperatures in the thermal cycle the temperature histories of the various elements in the hardware being screened are determined by means of a thermal survey.
  5. Power is applied during the low to high temperature excursion and remains on until the temperature has stabilized at the high temperature. Power is turned off on the high to low temperature excursion until stabilization at the low temperature. Power is also turned on and off a minimum of three times at temperature extremes on each cycle.

Table 18. Baseline ESS Random Vibration Profile
Characteristic Level of Assembly
PWA1 Unit System
Overall Response Level2 6gRMS 6gRMS 6gRMS
Frequency3 20 - 2000Hz 20 - 2000Hz 20 - 2000Hz
Axes4 (sequentially or simultaneous) 3 3 3
Duration      
      - Axes Sequentially 10 Minutes/Axis 10 Minutes/Axis 10 Minutes/Axis
      - Axes Simultaneously 10 Minutes 10 Minutes 10 Minutes
Product condition Unpowered (Powered if purchased as an end item deliverable or as a spare) Powered, Monitored Powered, Monitored
Notes:
Pure random vibration or quasi-random vibration are considered acceptable forms of vibration for the purpose of stress screening. The objective is to achieve a broad-band excitation.
  1. When random vibration is applied at the unit level, it may not be cost effective at the PWA level. However, PWAs manufactured as end item deliverables or spares may require screening using random vibration as a stimulus. However, at the system level, when a response survey indicates that the most sensitive PWA is driving the profile in a manner that causes some PWAs to experience a relatively benign screen, that PWA should be screened individually. Each PWA screened separately should have its own profile determined from a vibration response survey.
  2. The preferred power spectral density for 6gRMS consists of 0.04 g2/Hz from 80 to 350 Hz with a 3 dB/octave roll off from 80 to 20 Hz and a 3 dB/octave roll off from 350 to 2000 Hz.
  3. Vibration input profile for each specific application should be determined by vibration response surveys which identify the correlation between input and structural responses. Higher frequencies are usually significantly attenuated at higher levels of assembly.
  4. Single axis or two axis vibration may be acceptable if data shows minimal flaw detection in the other axes.

The ESS program should be constantly monitored for effectiveness in preventing defects from causing failures after delivery, and changed as required. All failures of equipment during ESS should be analyzed as to root cause and appropriate corrective action taken and verified. The ultimate goal is to eliminate the need for ESS by eliminating the causes of defects.

3.12 Production Reliability Acceptance Test (PRAT)

3.12.1 Purpose. A production reliability acceptance test (PRAT) is performed to detect any degradation in the inherent reliability of a product over the course of production. It is used to ensure that delivered products continue to meet customers' requirements and/or expectations.

3.12.2 Benefit. During production, any delay in finding a product reliability problem results in a proportionate number of dissatisfied customers complaining about degraded reliability. This is usually costly, and can often be disastrous. Companies have gone out of business because products were sold with a serious undiscovered reliability problem that became evident during customer use or because the product did not perform as reliably as advertised. PRAT is intended to minimize the impact of production reliability problems by providing timely warning, and the supporting data needed, for corrective action.

3.12.3 Timing. PRAT only takes place during the Production/Manufacturing phase of the product life cycle. Depending on the method used, PRAT can be periodic or continuous during production, and can be done on a sampling basis or on 100% of the products, depending on the nature of the product and its market environment.

3.12.4 Application Guidelines. There are at least four different ways of testing during production, each with certain advantages and penalties. These are:

Periodic repetition of a RQT. Many product developments include the demonstration of reliability before production begins, using a test with known and acceptable statistical risks of error. This test is known as a reliability qualification test (RQT). The simplest form of RQT is a test run in an environment simulating use conditions for a given period of time with a specified allowable number of failures. If the test is completed without exceeding the allowable number of failures, the product reliability is considered acceptable. Otherwise, it is considered unacceptable. The test time and allowable number of failures are chosen to satisfy two specified statistical risks. These are the "consumer's risk" which is the probability of accepting a product with an achieved reliability (expressed as a mean time between failures or MTBF), that is considered unacceptable (θ1), and the "producer's risk" which is the probability of rejecting a product with a reliability (MTBF) that is considered acceptable (θ0). Based on an assumption of a constant failure rate, Table 19 presents the test times and allowable number of failures for a variety of producer's and consumer's risks, where the total test time is the sum of the test time of all products on test.

For low risks and/or high MTBFs, the test times indicated in Table 19 may be excessively long. For this reason, sequential tests are often used. These permit decisions to be made before the allowable number of failures or the scheduled test time have elapsed. Instead, if a combination of failures and elapsed time is more likely to occur when the test units have the unacceptable MTBF, as compared to units having the acceptable MTBF, a reject decision is made. Where a combination of failures and elapsed time is more likely to occur when the test units have the acceptable MTBF, an accept decision is made. Where neither condition is satisfied, the test continues to an arbitrary truncation point (used to ensure it will never run significantly longer than an equivalent fixed time test). Figure 8 illustrates a sequential test. Failures are plotted against time, and when the plot "escapes" the continue test region, a decision is made to accept or reject the product, as appropriate. A number of representative sequential tests are summarized in Table 20.

Table 19. Fixed Length Test Plans
Nominal Decision Risks Discrimination
Ratio
θ01
Test Duration
(Multiples
of θ1)
Accept-Reject Failures
Producer's Consumer's Reject
(Equal or More)
Accept
(Equal or Less)
10% 10% 1.5 45.0 37 36
10% 20% 1.5 29.9 26 25
20% 20% 1.5 21.5 18 17
10% 10% 2.0 18.8 14 13
10% 20% 2.0 12.4 10 9
20% 20% 2.0 7.8 6 5
10% 10% 3.0 9.3 6 5
10% 20% 3.0 5.4 4 3
20% 20% 3.0 4.3 3 2
30% 30% 1.5 8.0 7 6
30% 30% 2.0 3.7 3 2
30% 30% 3.0 1.1 1 0


Figure 8. Typical Sequential Test
Figure 8. Typical Sequential Test (Click to Zoom)

Table 20. Sequential Test Plans
Nominal Decision Risks Discrimination
Ratio
θ01
Time to Accept Decision in MTBF
(θ1 Multiples)
Producer's Consumer's Min Exp1 Max2
10% 10% 1.5 6.6 25.95 49.5
20% 20% 1.5 4.19 11.4 21.9
10% 10% 2.0 4.40 10.2 20.6
20% 20% 2.0 2.80 4.8 9.74
10% 10% 3.0 3.75 6.0 10.35
20% 20% 3.0 2.67 3.42 4.5
30% 30% 1.5 3.15 5.1 6.8
30% 30% 2.0 1.72 2.6 4.5
Notes:
  1. Expected test time assumes a true MTBF is equal to θ0 (acceptable MTBF)
  2. Arbitrary truncation point
The simplest approach to PRAT, assuming that a RQT has been performed, is to repeat the RQT at intervals during the production run. Advantages are the use of a familiar test procedure which the product is known to have passed. A variation is to use an RQT with higher risks, which has the advantage of shorter test times. The disadvantage of this approach is somewhat subtle, in that the repetition of a test increases the chance of error. For example, assume an equipment has an inherent mean time between failure (MTBF) which has a 90% probability of passing a certain RQT. If two tests are scheduled during production, the probability that it will pass both is (.90)2 or 81%. If six tests are scheduled during a long production run, the probability that the product will pass all of the tests would be (.90)6 or 53%. Thus, a product which would pass a RQT 90% of the time would have only an even chance of passing six production reliability acceptance tests, with no change in its inherent reliability.

The all-equipment production reliability acceptance test. As the name implies, every production equipment is subjected to a specified number of hours on test, and the total test time and number of failures are used to determine rejection or continued acceptance of the product for shipment to customers/distributors. One possible test is illustrated in Figure 9.

Figure 9. All-Equipment Production Reliability Acceptance Test
Figure 9. All-Equipment Production Reliability Acceptance Test (Click to Zoom)

During the test, time and failures are plotted, and as long as the plot remains within the accept and continue test region, the product is considered acceptable for shipment to customers. Should the plot enter the reject region, shipments should be stopped until the reliability problem is rectified. The plot is not allowed to enter the region below the boundary line. If the plot contacts the boundary line, it stays in place until the next failure occurs. Thus, the plot is never farther away from rejection than the distance between the boundary and reject lines. This is to ensure fast response to the appearance of a reliability problem.

The problem with the all equipment test is that the probability of reaching a reject decision is a function of the total test time. If the total test time is short relative to the desired MTBF, it is easy to pass and even poor values of MTBF will pass too often. If the total test time is long, even acceptable values of MTBF will be rejected. For the test parameters of Figure 9, the probability of acceptance for various test lengths is as shown in Figure 10.

Figure 10. Probability of Acceptance for Various Test Lengths
Figure 10. Probability of Acceptance for Various Test Lengths (Click to Zoom)

Bayesian reliability testing Bayesian reliability tests are based on the premise that data available before testing should be used with the test data to decide acceptability. The advantage of such tests is that a favorable prior (data before the test) provides a shorter test than less favorable data, and that a new prior is computed after each test based on the test results. Disadvantages are that, despite academic interest, there have been few practical uses made of Bayesian reliability tests, and no standard references exist.

Statistical process control Statistical Process Control (SPC) has long been used as a means for controlling critical parameters of a product during production. Reliability can be handled by SPC as well as any other parameter. The SPC approach does not have the statistical problems of the other methods for production reliability measurement. However, when using it to measure failure rates, it may have a serious disadvantage. This is because sample sizes should be large enough so that any significant deviation from expected reliability will cause at least one failure in the sample. For products with very low failure rates (very high MTBFs), it may be impractical to use large enough samples. Currently, there is no problem in tracking such measures as defects per unit in large items produced in large numbers, such as automobiles.

3.13 Statistical Process Control (SPC)

3.13.1 Purpose. Statistical Process Control (SPC) is designed to ensure that a manufacturing process continues to produce products with no more than expected variation in critical parameters. It provides warning when some factor has caused the process to deviate from its intended operational limits.

3.13.2 Benefit. Many factors can affect a process. A bad lot of parts, an untrained worker, tool wear, etc., can each alter a process so that the resulting products are not as originally designed, potentially resulting in degraded inherent reliability. SPC provides a continual check on critical parameters and warns when they vary outside of expected, statistically-based limits.

3.13.3 Timing. SPC can only be implemented after a process is standardized, and is intended for long term processes, such as assembly lines in the Production/ Manufacturing phase. The processes to which SPC will be applied, however, should be identified during the Concept/Planning and/or Design/Development phases of the product life cycle.

3.13.4 Application Guidelines. Measured parameters of sample products taken from a stable process will be distributed in a normal bell-shaped distribution with a mean equal to the process mean and standard deviation related to the process variation. It is generally assumed that normal variation will result in measurements varying randomly, within plus or minus three standard deviations from the mean (four standard deviations in the automotive industry). Measurements outside these control limits, or showing non-random patterns, indicate that the process has changed. Corrective action is then warranted. Each measurement (the mean of several sample products from one production lot) is recorded sequentially on a control chart marked with the desired mean value and the control limits, so that the randomness of the measurements, as well as their extreme values, can be observed.

Measured parameters can be variables, such as power or frequency; proportions, such as percent defective products; or rates, such as defects per system or failure rate. Each type of parameter will require a different method for computing process variability, resulting in a different equation for computing the standard deviation of the sample, and, in turn, the upper and lower control limits. Whether the sample size is constant or changing will also make a difference. Finally, when measuring variables, the range between the highest and lowest measurement is usually also of interest. Table 21 presents the information needed to set up control charts for any of these situations. Table 22 provides the additional data constants needed to calculate the control chart limits.

Table 22. Control Chart Constants
Constant Number of Samples (n)
2 3 4 5 6 7 8 9 10 12 15 20
A2 1.88 1.02 .73 .58 .48 .42 .37 .34 .31 .27 .22 .18
D3 0 0 0 0 0 .08 .14 .18 .22 .28 .35 .41
D4 3.27 2.57 2.28 2.11 2.00 1.92 1.86 1.82 1.78 1.72 1.65 1.59

As an example, suppose an aircraft manufacturer believes that the inherent reliability of his product, expressed in defects per aircraft found in final inspection, is 0.8. The manufacture decides to apply SPC to ensure that production factors do not degrade this value. Since defects per aircraft is a rate, one of the rate formulas from Table 21 will be used. The data can be plotted by the number of aircraft inspected or by month. In either case, the centerline of the control chart ( r or μ) will be 0.8. For a plot by aircraft, the control limits (from Table 21) would be:

      UCL = r + 3√r = 0.8 + 3√0.8 = 0.8 + 3(0.89) = 3.47

      LCL = r - 3√r = 0.8 - 3√0.8 = 0.8 - 3(0.89) = -1.87 = 0*

*negative control limits typically have no meaning and are set to zero.

Hence, so long as the defects per aircraft varied randomly and never exceeded 3.47 (since the number of defects is an integer, the limit would be 3) for any aircraft, the manufacturer would conclude that the process is in control and not degrading the inherent reliability. If either of these conditions was violated, the manufacturer would search for the root cause of the process problem.

Excerpt from "Table 21. Process Control Charts" See Full Version
Table 21. Process Control Charts

Should the manufacturer wish to plot a control chart by month rather than by aircraft, the control limits will vary depending on the number of aircraft produced each month.
From Table 21:

      UCL = μ + 3 (√μ / √n) = 0.8 + (3√0.8) / √n

      LCL = μ - 3 (√μ / √n) = 0.8 - (3√0.8) / √n

where n = sample size (aircraft produced in one month)

For a month in which 3 aircraft were produced:

      UCL = 0.8 + 3√0.8 / √3 = 0.8 + 3(0.89)/(1.73) = 2.34

      LCL = 0.8 - 3√0.8 / √3 = 0.8 - 3(0.89)/(1.73) = -0.74 = 0

If there were no more than 2.34 defects per aircraft (mean value) in a month when three aircraft were produced, the process would be considered in control.

3.14 Market Survey

3.14.1 Purpose. A market survey is an effort to determine the needs, desires and perceptions of a customer (or potential customer) for present or future products made by a supplier. For ensuring inherent reliability, market surveys are used to obtain feedback from customers on their achieved reliability of the product, and to identify any failure trends that may indicate that the inherent reliability of the product is being compromised.

3.14.2 Benefit. A product not meeting the needs of customers will not sell. One meeting their desires as well as their needs may have a competitive advantage over other products that just meet the essential needs. In reliability, as with all product parameters, the most important measure is the customer's subjective opinion. It is, therefore, essential to know the needs, desires and perceptions of the customer. In addition, customers often experience reliability problems unknown to the supplier. A market survey to identify these problems is essential to their correction.

3.14.3 Timing. A market survey should be performed in the Concept/Planning phase to determine the needs of potential customers. Another is needed in the Operation/Repair phase to obtain feedback on customer satisfaction and identify specific problems. In between, sufficient attention should be paid to track the progress of the product in meeting the needs of the customer in order to detect changes which may influence modifications in the product design or manufacturing processes.

3.14.4 Application Guidelines. A market survey in the Concept/Planning phase can be taken by a special effort to communicate with potential customers through a one-time mail or telephone contact with a list of relevant questions. General likes, dislikes, and the perception of the producer's standing among competitors can be obtained in this manner.

Specific questions might include the potential rating of the desirability of possible features of the product (e.g., On a scale of one to five, how much would you value an electric starter on a snowblower? How about a headlight on the machine?) The results, plus additional suggestions from the customers, can be used to create a product design, which will, in turn, be a factor in determining the inherent reliability. Other questions can help establish the minimum reliability and maintainability values acceptable to the customer (e.g., How many days of operation would you expect between servicing or repair of your snowblower?). Customer perceptions of the manufacturer's products can also be identified (e.g., On a scale of one to five with five being the best in the industry, how would you rate the reliability of our products?) All this baseline data can be used to establish an inherent reliability. Later market surveys can be used to help establish the achieved reliability for comparison purposes. These later surveys can include the use of questionnaires similar to the initial survey, but for maximum effectiveness in ensuring reliability, should encompass a variety of other activities.

Specific problems with current products and clues to new features of interest can be obtained by day to day customer service interaction. The information derived from direct contact with the customer should not be disregarded once the immediate problems are solved. A feedback loop between customer service offices and product designers should be established.

For reliability assurance, special efforts should be made to obtain information on failures at the customers site. If the customer has a failure reporting analysis and corrective action system, failure results should be obtained and reviewed on a routine basis, if possible. For customers who do not use a failure reporting system, a special data collection effort at the start of customer operations may be worthwhile.

When a feature (e.g. high MTBF) is known to be a customer concern, the supplier needs to know how well that concern is being addressed. Benchmarking is a survey of the performance of leaders in the field as a means to judge one's own efforts. It is, of course, valuable to compare oneself to the competition, but it is also valuable to make comparisons to non-competitors who excel at similar processes. For example, it is good if the field failure rate of a manufacturer's product compares favorably to that of its competitors. It is even better if it compares well to that of the best comparable product in the marketplace.

One form of market survey is the use of a trial or focus group. For example, a proposed maintenance procedure could be tested by letting a group of technicians from another part of the company try it. From their comments, the manufacturer should gain some feel for how customers will react to it, and some ideas for improvement.

3.15 Inspection

3.15.1 Purpose. Inspection is the examination of a product to ensure that it was produced in accordance with design and manufacturing specifications. Its purpose is to find, rather than prevent, defects. In a broad sense, however, it is intended to eliminate defects from delivered products. Information from inspections can, and should, be used to locate and correct the root causes of product defects.

3.15.2 Benefit. Manual inspection is an inefficient and expensive way to detect defects. However, the feedback from inspections is a necessary part of training workers and adjusting processes to reduce variability in the product. As the product improves, inspection becomes less important, and can be done on a sample basis, or performed by the production workers themselves.

3.15.3 Timing. If product inspection is to be used, it should be implemented at the appropriate process steps during the Production/Manufacturing phase. Full time inspectors may be necessary until workers are sufficiently trained to inspect their own "products", particularly in those cases where manufacturing processes may not allow the workers to adequately inspect the product, or where confidence needs to be built in developing a new process or product. Inspections should be done at logical interfaces between processes. For example, a board may be cleaned, mounted with components, soldered and then integrated into a higher level assembly. A good inspection point would ordinarily be between the soldering and the integration into the assembly. Products should be 100% inspected by their producing workers and 100% by full time inspectors during the "tuning up" of a process. After a process has been found satisfactory, the full time inspectors may reduce their inspections to sampling or leave it to the production workers who will continue a 100% inspection, preferably aided by automated techniques.

3.15.4 Application Guidelines. A pilot's pre-flight check is an inspection. So are the annual automobile safety inspections mandated by many states. A driver's glance at his tires to check tread wear is an inspection. All of these scenarios have obvious benefits in detecting something which is not normal or expected. None of these inspections are particularly tedious. However, the inspection of hundreds or thousands of products in a factory requires a sustained level of attention that is difficult to maintain. As a result, many errors are made. For this reason, effort spent on process control (trying to eliminate the need for inspection by preventing defects) is usually quite profitable. Another approach is "Polka-Yoke" (mistake proofing), which is essentially an automated inspection philosophy. It includes the use of guide pins of different sizes, so parts of an assembly cannot be put together incorrectly; error detection and alarms for automated indication of defects; limit switches to preclude error; and counters and checklists for the operator to use in assuring that nothing has been overlooked. This 100% automated inspection technique is part of a quality system created by Shigeo Shingo for the Toyota automobile production line. The system has three components:
  • Source inspection (the search for root causes of defects to be eliminated).
  • Polka-Yoke (the automated inspection for defects and a supplier of data for source inspection).
  • Immediate action (the production line stops when a problem occurs until it is fixed).
Inspections to remove (or prevent) defects should be done on 100% of the product. After process changes have removed the causes of defects, inspections may be performed on a sampling basis, to monitor the product rather than to precipitate defects. These inspections often apply statistical accept/reject criteria, where a reject decision results in a search for the root cause of a problem and/or the return to 100% inspection. Production reliability acceptance testing and statistical process control are forms of sampling with statistical accept/reject criteria. Another form of sampling plan can be created from the cumulative Poisson distribution equation. The cumulative Poisson technique determines the probability of "x" events or less occurring when "μ" events are expected in a sample, and is stated as follows:
Equation

The expected number of events (μ ) will equal the rate of occurrence of the event times the sample size. For example, if 5 defective parts were expected for each 1000 parts produced (a defect rate of .005 per part), a sample size of 100 would be expected to contain 0.5 defects. Using the Poisson equation produces the following results:

Number of Defects in Sample (x) Probability of Exactly x Defects in Sample (P) Probability of x or Less Defects in Sample (Px)
0 .6065 .6065
1 .3022 .9097
2 .0758 .9855

This data permits the creation of sampling plans based on the acceptable risk of rejecting a good lot of parts. For example, a sampling plan could be to inspect 100 samples and reject if more than one failure occurs. The data shows that a sample of 100 parts with the expected defect rate would have one failure or less 90.97% of the time. Hence, we could reject samples having more than one defect, with less than a 10% risk (1 - .9097). The 10% risk represents the probability of rejecting a good lot, where "good" is defined as demonstrating a rate of 1 or less defects per product. If there is confidence that the processes are in control, a trade-off might be to lower the risk of rejecting a good lot by allowing two defects per sample, which would reject a good lot with probability of less than 2% (1 - .9855). This lower risk will make the inspection less sensitive to process changes that may cause higher defect rates in the product. Using the Poisson equation, sample sizes and accept/reject criteria can be selected to provide any desired risk of rejecting a good lot, given the expected defect rate.

Processes may also be inspected. For example, the ISO 9000 series of quality program standards and the Malcolm Baldrige National Quality Award provide criteria for the inspection of a company's quality management process by an outside agency. When an authorized inspector decides a company meets the criteria of ISO 9001, it is accepted that its quality management process is adequate in all of the elements described in that standard. An auditor for the Malcolm Baldrige Award uses its criteria to accept or reject the company as outstanding in the pursuit of quality. The criteria used in either case can be adapted by a company for a self-inspection of its quality processes, to find areas for improvement. Since these processes directly affect the achievement of a product's inherent reliability, these process inspections can be valuable tools for ensuring reliable product performance.



SECTION FOUR - REFERENCES

The references in Table 23 provide additional information on the subjects discussed in this Blueprint. The relationship between the reference and sections within the Blueprint are indicated in the table for each source.


Excerpt from "Table 23. References for Ensuring Reliable Performance" See Full Version
Table 23. References for Ensuring Reliable Performance
START 99-2 Add to Read Later list 
Performance-Based Requirements (PBRs)
Journal Article V10, N4 Add to Read Later list 
Program Managers Handbook - Common Practices to Mitigate the Risk of Obsolescence
Journal Article V12, N3 Add to Read Later list 
An Introduction to Task Analysis
Journal Article V12, N4 Add to Read Later list 
Improving Mission Performance & Reducing Total Ownership Cost
Journal Article V10, N4 Add to Read Later list 
Program Managers Handbook - Common Practices to Mitigate the Risk of Obsolescence
Journal Article V14, N3 Add to Read Later list 
Developing Highly Reliable and Safe Devices
Journal Article V8, N1 Add to Read Later list 
A Discussion of Software Reliability Modeling Problems
START 2005-2 Add to Read Later list 
Understanding Binomial Sequential Testing
START 00-3 Add to Read Later list 
Environmental Stress Screening
Journal Article V12, N1 Add to Read Later list 
Multivariable Testing (MVT)
Journal Article V11, N2 Add to Read Later list 
Five Key Ways to Improve Reliability
START 2005-1 Add to Read Later list 
Operating Characteristic (OC) Functions and Acceptance Sampling Plans
Journal Article V11, N4 Add to Read Later list 
Applying RCM Analysis to EA-6B Corrosion Failure Modes
START 2003-7 Add to Read Later list 
Reliability Estimations for the Exponential Life
START 2004-3 Add to Read Later list 
Censored Data
Journal Article V6, N3 Add to Read Later list 
Reliability Growth
Journal Article V8, N3 Add to Read Later list 
Environmental Stress Screening
Journal Article V11, N4 Add to Read Later list 
Applying RCM Analysis to EA-6B Corrosion Failure Modes
Journal Article V10, N4 Add to Read Later list 
Program Managers Handbook - Common Practices to Mitigate the Risk of Obsolescence
Journal Article V11, N4 Add to Read Later list 
Applying RCM Analysis to EA-6B Corrosion Failure Modes
START 2004-7 Add to Read Later list 
Derating
Journal Article V14, N2 Add to Read Later list 
Evaluating Soldering Irons for Pb-Free Assembly
Journal Article V13, N2 Add to Read Later list 
New DoD RAM Guide
Journal Article V12, N4 Add to Read Later list 
Improving Mission Performance & Reducing Total Ownership Cost
Journal Article V11, N3 Add to Read Later list 
A Beginners Guide to HALT
START 99-4 Add to Read Later list 
Accelerated Testing
Journal Article V8, N1 Add to Read Later list 
A Discussion of Software Reliability Modeling Problems
Journal Article V14, N3 Add to Read Later list 
Developing Highly Reliable and Safe Devices
Journal Article V9, N3 Add to Read Later list 
Markov vs. FTA
Journal Article V8, N4 Add to Read Later list 
Tutorial: Test Risks, Confidence and OC Curves
START 2004-3 Add to Read Later list 
Censored Data
START 2005-2 Add to Read Later list 
Understanding Binomial Sequential Testing
Journal Article V8, N4 Add to Read Later list 
Tutorial: Test Risks, Confidence and OC Curves
START 2005-1 Add to Read Later list 
Operating Characteristic (OC) Functions and Acceptance Sampling Plans
START 2003-8 Add to Read Later list 
Use of Bayesian Techniques for Reliability
Journal Article V11, N4 Add to Read Later list 
Applying RCM Analysis to EA-6B Corrosion Failure Modes
START 2002-4 Add to Read Later list 
Statistical Confidence
START 2003-7 Add to Read Later list 
Reliability Estimations for the Exponential Life
START 2002-2 Add to Read Later list 
Statistical Assumptions of an Exponential Distribution
START 2002-6 Add to Read Later list 
Empirical Assessment of Normal and Lognormal Distribution Assumptions
START 2003-3 Add to Read Later list 
Empirical Assessment of Weibull Distribution
START 2002-5 Add to Read Later list 
Graphical Comparisons of Two Populations
START 2003-5 Add to Read Later list 
Anderson-Darling: A Goodness of Fit Test for Small Samples Assumptions
START 2003-6 Add to Read Later list 
Kolmogorov-Simirnov: A Goodness of Fit Test for Small Samples
START 2003-4 Add to Read Later list 
The Chi-Square: a Large-Sample Goodness of Fit Test
Journal Article V10, N2 Add to Read Later list 
Statistics - A Reliability Engineer's Tool, Not Reliability Engineering
Journal Article V8, N1 Add to Read Later list 
A Discussion of Software Reliability Modeling Problems
Journal Article V9, N3 Add to Read Later list 
Markov vs. FTA
Journal Article V8, N4 Add to Read Later list 
Tutorial: Test Risks, Confidence and OC Curves
START 2005-1 Add to Read Later list 
Operating Characteristic (OC) Functions and Acceptance Sampling Plans
START 2002-4 Add to Read Later list 
Statistical Confidence
Journal Article V11, N4 Add to Read Later list 
Applying RCM Analysis to EA-6B Corrosion Failure Modes
START 2003-8 Add to Read Later list 
Use of Bayesian Techniques for Reliability
START 2002-4 Add to Read Later list 
Statistical Confidence
Journal Article V14, N1 Add to Read Later list 
Information Management for Systems Design for RMQSI
Journal Article V7, N4 Add to Read Later list 
Engineering Information Assurance into Information Systems
Journal Article V12, N4 Add to Read Later list 
Improving Mission Performance & Reducing Total Ownership Cost
START 95-2 Add to Read Later list 
Parts Management Plan
Journal Article V10, N3 Add to Read Later list 
Additional Sources for Supply Chain Management
Journal Article V10, N3 Add to Read Later list 
Additional Sources for Supply Chain Management
START 2005-1 Add to Read Later list 
Operating Characteristic (OC) Functions and Acceptance Sampling Plans
Journal Article V13, N2 Add to Read Later list 
New DoD RAM Guide
Journal Article V14, N4 Add to Read Later list 
Electronic Component Failure Rate Prediction
START 2004-2 Add to Read Later list 
The RMQSI Case - A Reasoned, Auditable Argument Supporting the Contention that a System Satisfies...
START 00-3 Add to Read Later list 
Environmental Stress Screening
Journal Article V8, N3 Add to Read Later list 
Environmental Stress Screening
Journal Article V14, N2 Add to Read Later list 
Evaluating Soldering Irons for Pb-Free Assembly
Journal Article V14, N2 Add to Read Later list 
Evaluating Soldering Irons for Pb-Free Assembly
Journal Article V9, N1 Add to Read Later list 
New Guidance for Using Performance-Based Standards
START 95-2 Add to Read Later list 
Parts Management Plan
Journal Article V10, N3 Add to Read Later list 
Additional Sources for Supply Chain Management
Journal Article V10, N3 Add to Read Later list 
Additional Sources for Supply Chain Management
START 2005-1 Add to Read Later list 
Operating Characteristic (OC) Functions and Acceptance Sampling Plans
Journal Article V11, N4 Add to Read Later list 
Applying RCM Analysis to EA-6B Corrosion Failure Modes
Journal Article V7, N4 Add to Read Later list 
Engineering Information Assurance into Information Systems
START 00-1 Add to Read Later list 
Sustained Maintenance Planning
Journal Article V12, N4 Add to Read Later list 
Improving Mission Performance & Reducing Total Ownership Cost
Journal Article V11, N2 Add to Read Later list 
Five Key Ways to Improve Reliability
Journal Article V9, N4 Add to Read Later list 
Real-Time Prognostic Condition-Based Maintenance for High Value Systems
Journal Article V9, N3 Add to Read Later list 
Markov vs. FTA
Journal Article V9, N2 Add to Read Later list 
Non-Normal Distributions in the Real World
Journal Article V12, N1 Add to Read Later list 
Multivariable Testing (MVT)
START 2003-5 Add to Read Later list 
Anderson-Darling: A Goodness of Fit Test for Small Samples Assumptions
START 2002-1 Add to Read Later list 
Application of the Poisson Distribution
START 2002-4 Add to Read Later list 
Statistical Confidence
START 2004-3 Add to Read Later list 
Censored Data
START 2004-1 Add to Read Later list 
Combining Data
START 2002-6 Add to Read Later list 
Empirical Assessment of Normal and Lognormal Distribution Assumptions
START 2003-3 Add to Read Later list 
Empirical Assessment of Weibull Distribution
START 2002-5 Add to Read Later list 
Graphical Comparisons of Two Populations
START 2003-6 Add to Read Later list 
Kolmogorov-Simirnov: A Goodness of Fit Test for Small Samples
START 2003-7 Add to Read Later list 
Reliability Estimations for the Exponential Life
START 2002-2 Add to Read Later list 
Statistical Assumptions of an Exponential Distribution
Journal Article V8, N4 Add to Read Later list 
Tutorial: Test Risks, Confidence and OC Curves
START 2003-4 Add to Read Later list 
The Chi-Square: a Large-Sample Goodness of Fit Test
START 2005-2 Add to Read Later list 
Understanding Binomial Sequential Testing
START 2005-1 Add to Read Later list 
Operating Characteristic (OC) Functions and Acceptance Sampling Plans
Journal Article V10, N2 Add to Read Later list 
Statistics - A Reliability Engineer's Tool, Not Reliability Engineering
Journal Article V11, N4 Add to Read Later list 
Applying RCM Analysis to EA-6B Corrosion Failure Modes
Journal Article V9, N3 Add to Read Later list 
Markov vs. FTA
START 2004-2 Add to Read Later list 
The RMQSI Case - A Reasoned, Auditable Argument Supporting the Contention that a System Satisfies...
Journal Article V14, N1 Add to Read Later list 
Information Management for Systems Design for RMQSI
Journal Article V7, N4 Add to Read Later list 
Engineering Information Assurance into Information Systems
START 00-1 Add to Read Later list 
Sustained Maintenance Planning
Journal Article V14, N4 Add to Read Later list 
Electronic Component Failure Rate Prediction
Journal Article V13, N1 Add to Read Later list 
Form, Fit, Function, and Interface - An Element of an Open System Strategy
Journal Article V14, N1 Add to Read Later list 
Information Management for Systems Design for RMQSI
Journal Article V7, N4 Add to Read Later list 
Engineering Information Assurance into Information Systems
Journal Article V12, N4 Add to Read Later list 
Improving Mission Performance & Reducing Total Ownership Cost
START 2004-2 Add to Read Later list 
The RMQSI Case - A Reasoned, Auditable Argument Supporting the Contention that a System Satisfies...
Journal Article V11, N2 Add to Read Later list 
Five Key Ways to Improve Reliability
START 00-1 Add to Read Later list 
Sustained Maintenance Planning
START 95-2 Add to Read Later list 
Parts Management Plan
START 01-2 Add to Read Later list 
Simulation-Based Acquisition (SBA)
START 00-1 Add to Read Later list 
Sustained Maintenance Planning
START 01-4 Add to Read Later list 
Design for Maintainability (DFM)
Journal Article V13, N1 Add to Read Later list 
Form, Fit, Function, and Interface - An Element of an Open System Strategy
START 00-1 Add to Read Later list 
Sustained Maintenance Planning
Journal Article V6, N2 Add to Read Later list 
Cost As An Independent Variable (CAIV)
START 97-3 Add to Read Later list 
Reliability Design for Affordability
Journal Article V12, N4 Add to Read Later list 
Improving Mission Performance & Reducing Total Ownership Cost
Journal Article V11, N2 Add to Read Later list 
Five Key Ways to Improve Reliability
START 01-4 Add to Read Later list 
Design for Maintainability (DFM)
Journal Article V12, N3 Add to Read Later list 
An Introduction to Task Analysis
Journal Article V10, N4 Add to Read Later list 
Program Managers Handbook - Common Practices to Mitigate the Risk of Obsolescence
Journal Article V10, N3 Add to Read Later list 
Additional Sources for Supply Chain Management
START 95-2 Add to Read Later list 
Parts Management Plan
Journal Article V11, N3 Add to Read Later list 
A Beginners Guide to HALT
START 00-2 Add to Read Later list 
Flexible Sustainment