|
|

Assessing Reliability Progress
About the RIAC Blueprints
The RIAC "Blueprints for Product Reliability" are a series of documents published by the Reliability Information Analysis Center (RIAC) to provide insight into, and guidance in applying, sound reliability practices. The RIAC is the Information Analysis Center chartered to be a centralized source of data, information and expertise in the subjects of reliability, maintainability and quality. While sponsored by the US Department of Defense (DoD), RIAC's charter addresses both military and commercial communities with the requirement to disseminate guidance information in these subjects. The Blueprints serve to provide information on those approaches to planning and implementing effective reliability programs based on experience, lessons learned, and state-of-the-art techniques. To make the Blueprints as useful as possible, the approaches and procedures are based on the best practices used by commercial industry and on the concepts documented in many of the now-rescinded military standards. The tree shown in Figure 1 depicts the Blueprints that make up the series (the shaded second tier box indicates this Blueprint).
In the government sector, and in particular the DoD, significant changes have been made regarding the acquisition of new products. Previously, by imposing standards and specifications, a DoD customer would require contractors to use certain analytical tools and methods, perform specific tests in a prescribed manner, use components from an approved list, and so Current emphasizes forth. policy the use of commercial technology as well as specifying "performance-based" requirements only, with suppliers left to determine how to best achieve them.
Figure 1. RIAC Blueprints for Product Reliability (Click to Zoom)
Users of the RIAC Blueprints
The Blueprints are designed for use in both the government and private sectors. They address products ranging from completely new commercial consumer products to highly specialized military systems. The documents are written in a style that is easy to understand and implement whether the reader is a manager, design engineer or reliability specialist. In keeping with the new philosophy of the DoD, which is now similar to that of the private sector, the Blueprints do not provide a cookbook of reliability tasks that should be applied in every situation. Instead, some general principles are cited as the underpinnings of a sound reliability program. Then, many of the tasks and activities that support each principle are highlighted in detail sufficient for the user to determine if a task or activity is appropriate to his or her situation.
SECTION ONE - INTRODUCTION
Assessment is a broad term that includes all techniques used to determine a product design status or operational capability. Assessment can include analysis, modeling, simulation, and testing. The most significant elements are analysis and modeling, as either can be accomplished early in the design process at a much lower cost than simulation or testing. Analysis is also an effective tool in evaluating design trade-offs that will result in a better product. An analysis can estimate the impact of more cooling, extra redundancy, better components, or extreme operating conditions. The purpose of this Blueprint, Assessing Reliability Progress, is to describe a number of tasks that should be considered if a producer wishes to assess the status of inherent product design reliability before manufacturing of the product begins. Each task is presented in sufficient detail to describe to the user how to do the assessment, how to interpret the results and when the task should be performed. Examples and references are included as guides for the analyst. The discussion of each assessment technique will consider:
- Purpose (what)
- Benefit (why)
- Timing (when)
- Application guidelines (how)
SECTION TWO GENERAL CONSIDERATIONS FOR ASSESSING RELIABILITY PROGRESS
This section addresses assessment issues that should be considered if a continuous reliability program that updates the status of the design is needed.
2.1 The Goals of Reliability Assessment
Reliability is traditionally considered to be a performance attribute that is concerned with the probability of success and frequency of failures, and is defined as:
| |
The probability that an item will perform its intended function understated conditions, for either a specified interval or over its useful life. |
Reliability assessments are performed to assess design progress towards meeting customer needs. In addition, assessments of product design alternatives, options and changes can be performed to evaluate their impact on customer needs, schedule and costs. The assessment process should be considered an iterative one to review reliability progress throughout the product design and development phases. Each assessment should be thought of as one step in the design decision process.
Section Three of this Blueprint describes each assessment task, indicates the proper time for implementing the task and may include an example to aid in understanding its application. Table 1 identifies those tasks (historically classified as design, analysis and test) that have been proven to effectively assess the product design throughout the entire design cycle. Continuous evaluation of product reliability will add value for the customer by reducing the number of design problems and component defects.
Table 1. Reliability Tasks for Assessing Reliability Progress
| Type of Activity |
Tasks and Description |
Section |
D E S I G N |
Critical Item Control. Monitoring in-house and suppliers'activities to reduce the risk to product reliability from items identified as critical. Can include hardware and software. |
3.2 |
| Design Reviews. Formal or informal independent evaluation and critique of a design to identify and correct hardware or software deficiencies. |
3.3 |
| Supplier Control. Monitoring suppliers' activities to assure that purchased hardware and software will have adequate reliability. |
3.4 |
A N A L Y S I S |
Design of Experiments (DOE). Systematically determining the impact of process and environmental factors on a desired product parameter, in order to reduce product variability by controlling the factors. |
3.5 |
| Dormancy Analysis. Determination of the effects of expected periods of storage or other non-operating conditions on the reliability of the product. |
3.6 |
| Durability Analysis. Determination of whether or not the mechanical strength of a product will remain adequate for its expected life. |
3.7 |
| Failure Modes, Effects&Criticality Analysis (FMECA). Systematically determining the effects of part or software failures on the product's ability to perform its function. This task includes FMEA. |
3.8 |
| Failure Reporting Analysis & Corrective Action System (FRACAS). A closed- loop system of data collection, analysis and dissemination to identify and correct failures of a product or process. |
3.9 |
| Fault Tree Analysis (FTA). Using inductive logic to determine the possible causes of a defined undesired operational result. |
3.10 |
| Finite Element Analysis (FEA). Determining the mechanical stresses present in products through simulation by decomposing the product into simple elements. |
3.11 |
| Life Cycle Planning. Determining reliability (and other) requirements by considering the impact over the expected useful life of the product. |
3.1 |
| Parts Obsolescence. Analysis of the likelihood that changes in technology will make the use of a currently available part undesirable. |
3.12 |
| Predictions. Estimation of reliability from available design, analysis or test data, or data from similar products. |
3.13 |
| Sneak Circuit Analysis (SCA). Investigation to discover the existence of unintended signal paths in a product. |
3.14 |
| Thermal Analysis. Analysis of the heat dissipations, transfer paths and cooling sources to determine if part/product temperatures are consistent with reliability needs. |
3.15 |
| Worst Case Circuit Analysis (WCCA). Analysis of the effects of variability in the components of a product on the product's performance. |
3.16 |
T E S T |
Accelerated Life Testing. Testing at high stress levels over compressed time periods to draw conclusions about the reliability of a product under expected operating conditions, based on formulated correlation factors. |
3.18 |
| Reliability Growth Test (RGT)/Test Analyze and Fix (TAAF). Testing a product to identify reliability deficiencies in order to eliminate their causes. |
3.19 |
| Test Strategy. Determination of the most cost effective mix of tests for a product. |
3.17 |
2.2 Product Program Phases
Each product, from the simplest to the most complex, passes through a sequence of phases during its life cycle. The definitions of the phases vary among commercial companies, and within the military. Table 2 describes the sequence of general phases that will be used in this document to describe a product's life.
Table 2. Product Life Cycle Phases
| Concept/ Planning |
Design/ Development |
Production/ Manufacturing |
Operation/ Repair |
Wearout/ Disposal |
- Formulate ideas, estimate resources and financial needs
- Identify risks & requirements
- Program objective
|
- Identify and allocate needs and requirements
- Propose alternate approaches
- Design and test the product
- Develop manufacturing, operating, and repair/ maintenance tasks
|
- Refine and implement manufacturing procedures
- Finalize production equipment
- Establish quality processes
- Build&distribute the product
|
- Implement operating, installation and training procedures
- Provide repair and maintenance service
- Repair warranty items
- Provide for performance feedback
|
- Implement refurbish- ment and disposal tasks
- Resolve potential wearout issues
|
What sometimes distinguishes one phase from the next is a decision milestone,
sometimes referred to as a "gate." It represents a point in time where the program can
go forward or stop. For many products, the phases may be abbreviated or combined.
For example, the Concept/Planning and Design/Development phases may be combined
under a compressed schedule for a new product that is simply an update or slightly
modified version of an older, proven product. Reliability tasks for this type of program
would concentrate only on the differences between the old and the modified product.
As a result, the number of engineering tasks would be reduced. It is important to
understand that tasks performed in one phase are often the result of the analysis, trade-
offs and planning performed in an earlier phase. For example, trade-offs addressing
approaches to manufacturing printed circuit boards would be performed during
Design/Development, with the implementation of the process decision to follow during
the Production/Manufacturing phase.
2.3 Task Selection Guide
The performance of any of the reliability tasks described in this Blueprint requires a
financial and schedule commitment by the product manufacturer. Therefore, selection
of the tasks should be on a value-added basis. Figure 2 shows some of the failure
causes that a product might experience and, for each cause, appropriate reliability
analysis techniques are indicated. For example, if a product is expected to be used by a
variety of operators and may be subjected to possible operator error, tasks such as fault
tree or sneak analysis should be considered to find and eliminate potential problems.
Using this figure, a manufacturer could establish an appropriate list of reliability
assessment tasks that will potentially enhance their product. Figure 3 was adapted from
an article in the ITEA Journal of Test and Evaluation to show which reliability tasks
result in the most design changes. As can be seen, thermal analysis is by far the most
effective task and should be considered if the operational environment is more severe
than a typical office environment.
Figure 2. Product Failure Causes and Assessment Techniques (Click to Zoom)
Figure 3. Design Changes As A Result Of Analysis Type* (Click to Zoom)
2.4 Tailoring Instructions
For most products, the customer's reliability needs are satisfied through sound design
practices, proper application of parts and components, and good manufacturing
processes. However, for complex products that involve many vendors and designers,
interim assessment of the progress may be needed as indicated in Table 3. This table
lists a number of techniques that are useful in assessing reliability progress and includes
guidance for their use.
Most of these techniques provide valuable means of
understanding a product's design strengths and weaknesses so that appropriate changes
can be implemented.
Table 3. Application Guidance for Assessing Reliability Progress
| Tasks |
Application Guidance |
| Accelerated Life Testing |
Effective on parts, components or assemblies to identify failure mechanisms and life limiting critical components. |
| Critical Item Control |
Apply when safety margins, process procedures and new technology present
risk to the production of the product. |
| Design of Experiments
(DOE) |
Use when process physical properties are known and parameter interactions
are understood.
Usually done in early design phases, it can assess the
progress made in improving product or process reliability. |
| Design Reviews |
Continuing evaluation process to ensure details are not overlooked. Should
include hardware and software. |
| Dormancy Analysis |
Use for products that have "extended" periods of non-operating time, unusual
non-operating environmental conditions, or high cycle on-and-off periods. |
| Durability Analysis |
Use to determine cycles to failure or determine wearout characteristics.
Especially important for mechanical products. |
| Failure Modes, Effects and
Criticality Analysis
(FMECA) |
Applicable to equipment performing critical functions (e.g., control systems)
when the need to know consequences of lower level failures is important. |
| Failure Reporting Analysis
and Corrective Action
System (FRACAS) |
Use when iterative tests or demonstrations are conducted on breadboard, or
prototype products to identify mechanisms and trends for corrective action. |
| Fault Tree Analysis (FTA) |
Use for complex systems evaluation of safety and system reliability. Apply
when the need to know what caused a hypothesized catastrophic event is
important. |
| Finite Element Analysis
(FEA) |
Use for designs that are unproven with little prior experience/test data, that
use advanced/unique packaging/design concepts, or will encounter severe
environmental loads. |
| Life Cycle Planning |
Use to strategize value-added mix of reliability analysis/test assessment
techniques. |
| Parts Obsolescence |
Use to determine need and risk of application of parts and lifetime buys. |
| Predictions |
Use as a general means to develop goals, choose design approaches, select
components, and evaluate stresses. |
| Reliability Growth Test (RGT)/Test Analyze and Fix (TAAF) |
Use when technology or risk of failure is critical to the success of the product.
These tests are costly in comparison to alternative analytical assessment
techniques. |
| Sneak Circuit Analysis
(SCA) |
Apply to operating and safety critical functions. Important for space systems
and others of extreme complexity. May be costly to apply. |
| Supplier Control |
Apply when high volume or new technologies for parts, materials or
components are expected. |
| Test Strategy |
Use when critical technologies result in high risk of failure. |
| Thermal Analysis |
Use for products with high power dissipation, or thermally sensitive aspects
of design. Typical for modern electronics, particularly densely packaged
products. |
| Worst Case Circuit
Analysis (WCCA) |
Use when the need exists to determine critical component parameter
variation and environmental effects on circuit performance. |
The assessment methods chosen should be appropriate to the product under development and the operating environment expected. For example, a thermal analysis may not be needed for a product operated in an air conditioned office, but should be considered for a product operated in an outside unprotected environment. The methods chosen should represent a reasonable level of investment when compared to the value of the results. For nondevelopmental items, only methods that confirm suitability of the product to the intended environment and application should be considered. Table 4 contains a list of recommended tasks as a function of several product classifications as a starting point. Tasks can be added or deleted depending on the consequence of failure of the product and the customers' expectations.
Table 4. Assessment Tasks Tailored by Product Classification
| Assessment Tasks |
Consumer |
Industrial |
Military |
| Product |
Durable |
Equip- ment |
System |
Structure |
Equip- ment |
System |
Strategic |
| Accelerated Test |
|
|
X |
|
|
X |
|
|
| Critical Items |
|
|
|
|
|
|
X |
X |
| Design of Experiment |
|
|
|
|
X |
|
|
X |
| Design Review |
|
|
|
X |
|
|
X |
|
| Dormancy |
|
|
|
|
|
X |
|
|
| Durability |
|
|
|
|
X |
|
|
X |
| Failure Modes |
|
X |
X |
X |
|
X |
X |
X |
| Failure Reporting |
|
X |
X |
X |
|
X |
X |
X |
| Fault Tree Analysis |
|
|
|
X |
X |
|
X |
X |
| Finite Element Analysis |
|
|
|
|
X |
|
|
X |
| Life Cycle Planning |
|
|
|
|
X |
|
X |
X |
| Part Obsolescence |
|
|
|
|
|
|
|
X |
| Predictions |
X |
X |
X |
X |
|
X |
X |
X |
| Reliability Growth Test |
|
|
|
|
|
X |
|
|
| Sneak Circuit |
|
|
|
X |
|
|
X |
|
| Supplier Control |
|
|
|
X |
|
X |
X |
X |
| Test Strategy |
|
|
|
X |
|
|
X |
|
| Thermal Analysis |
|
X |
X |
X |
|
X |
X |
X |
| Worst Case Analysis |
|
|
|
|
|
|
|
X |
SECTION THREE - TASKS FOR ASSESSING RELIABILITY PROGRESS
3.1 Life Cycle Planning
3.1.1 Purpose. Basic constraints on design practices include design life and operational and environmental profiles. Life cycle planning assesses the useful life characteristics of the product based on changes in or modifications to material, parts and processes. It also addresses the concepts of planning and implementing the required reliability design, analysis, test, and repair strategies to ensure that the customers'product life requirements are achieved.
3.1.2 Benefits. Life cycle planning provides an assessment of the "big picture" in determining how to most effectively (reliable performance over the life of the product) and most efficiently (minimize product cost) meet the long-term needs of the customer. Thorough life cycle planning means that product designers are aware of the imposed constraints (performance, reliability, cost and schedule) and will use only those value-added approaches which will meet those constraints.
3.1.3 Timing. Product life cycle characteristics need to be defined early in the Concept/Planning phase. Preferred design approaches are selected based on customer needs. When life limiting materials or parts are identified, control procedures need to be instituted as soon as possible to limit life cycle costs.
3.1.4 Application Guidelines. In assessing a product for reliability, life cycle planning activities should include the selection and analysis of materials, parts, components and software (and their respective suppliers) that will meet product life requirements. Tasks that can directly impact this aspect of product assessment, either through direct selection or trade studies, include:
- Environmental characterization
- Durability assessment
- Thermal analysis
- Design of experiments (DOE)
- Dormancy analysis
- Failure mode analysis
- Reliability predictions
- Finite element analysis (FEA)
Appropriate and effective application of these tasks will result in (1) a realistic
assessment of the conditions under which the product is expected to operate and (2) a
means of evaluating materials, parts and components as being suitable to withstand the
rigors of the end-use environment. Once the design approach has been selected, life
cycle planning can be extended to include those tasks which will assess progress
towards meeting the design reliability requirements, measure the level of achieved
inherent reliability and ensure that the inherent reliability of the product is not degraded
through subsequent production/manufacturing processes and customer use.
3.2 Critical Item Control
3.2.1 Purpose. The purpose of critical item control is to limit the negative reliability impact of using highly complex, advanced state-of-the-art parts and techniques in new or modified product designs.
3.2.2 Benefits. The ability to identify, assess and control critical items is imperative since these parts often drive unreliability. The benefits from implementing a controlled critical item process can include:
- Reduction in limited life items (components that wear out before normal end of life) in the product design
- Reduced electronic circuit sensitivity
- Multiple sources for components
- Special tests to assess the ability of critical parts to meet design constraints
3.2.3 Timing. A critical item control process should be started in the Concept/ Planning phase, as this is the time that assessments and trade-offs in component technologies, sources, and process techniques can be accomplished with minimum impact on design and production costs and schedule. Waiting until the Design/Development or Production/Manufacturing phases will likely result in product design susceptibility to critical item problems.
3.2.4 Application Guidelines. Critical items are those items that have a significant impact on product reliability, performance, safety, availability or life cycle cost. Critical items often include high cost components, new technology, limited life items, reliability sensitive parts, single source or custom components and single failure points (failures that cause total loss of product operation).
Control of the critical item is accomplished through design reviews, monitoring
suppliers, testing or screening components, inspecting the materials, establishing
handling procedures and documenting the results. A typical critical item control check
list is shown in Table 5.
Table 5. Critical Item Control Checklist
| Major Concern |
Recommended Action |
| Have compensating features been considered
for the design? |
Consider features like safety margins, overstress testing, or
fault tolerance |
| Have reliability improvements been
considered? |
Evaluate special stress tests, vendor quality procedures,
alternate components, operating duty cycles |
| Are overly stringent tolerances for
manufacturing or performance required? |
Adopt alternate vendors or process procedures |
| Does the operating environment exceed design
limits? |
Include fault tolerant designs, safety margins and external
changes (i.e., special cooling) |
| Are design reviews utilized to control critical
items? |
Standardize periodic reviews for management and
engineering |
Other factors to be considered:
- Failures jeopardizing safety
- Restrictions on limited life
- Exceeding derating practices
- Single sources for parts
- Historically failure prone items
- Single failure points that disrupt mission performance
|
A list of critical items and personnel responsible for controlling and reviewing procedures should be established |
3.3 Design Reviews
3.3.1 Purpose. Depending on the stage of product development, design reviews may be conducted for different reasons. Some of these reasons are to:
- Ensure that the product design is reliable.
- Assess product safety margins.
- Evaluate the ease of maintenance and inspection.
- Determine if the product is manufacturable.
- Review the allocation of design requirements and analyze the product for compliance.
- Discuss product interaction concerns, i.e., design to production, production to customer use, design to customer use.
- Challenge the design from various viewpoints, i.e., safety, environment, operation, human interface, etc.
- Determine the shortfalls of the product and issues to be resolved.
- Evaluate concurrent engineering and manufacturing processes and procedures.
3.3.2 Benefits. The benefits of an organized design review include the detailed evaluation of the product to ensure that the design or production process is technically sufficient to meet the customers' requirements for performance, cost and quality. When properly performed, the design review will ensure that no specific area of concern has been overlooked, and that lessons learned from previous reviews have been investigated so that fewer deficiencies will reach the next phase of product development. Finding and solving concerns, errors and design faults through design reviews will result in fewer redesigns, lower production costs and increased life for the product.
3.3.3 Timing. Design reviews should be an on-going process in order to be effective. A continual assessment process ensures that details are not overlooked. Reviews at each stage of product design, development and production should be conducted before proceeding to the next phase. Some of the milestones that should be considered as potential review points are:
- Completion of customer requirement assessment (actual or derived).
- Completion of specification and requirement allocation process.
- Completion of initial design phase.
- Completion of final design phase.
- Completion of prototype testing.
- Completion of initial manufacturing phase.
3.3.4 Application Guidelines. Design reviews can be conducted at almost any point
within the design process to assess the design progress. If concurrent engineering
techniques are used, the reviews can become part of an on-going day-by-day assessment
process. Typical milestone points and some key characteristics are presented in Figure
4.
Figure 4. Potential Stages for Design Reviews (Click to Zoom)
Formal Reviews. A formal review of product design concepts and design
documentation for hardware and software can be an important event in most product
development programs. If standard procedures are not explicitly stated by the customer or dictated by internal policy, the approach outlined in Figure 5 should be considered
for each design assessment:
Figure 5. Approach to Formal Design Reviews (Click to Zoom)
To perform continued assessment of the product, a review team composed of actual
designers and independent evaluators needs to be assembled. Table 6 defines potential
review team participants and their responsibilities. In an ideal review situation, the
team would consist of the actual designer and an independent evaluator for each of the
engineering functions.
Table 6. Design Review Membership
| Member |
Responsibilities |
| Product Engineer |
Conduct the meeting, issue reports, assign problems, responsible for closing
the loop. Substantiate design decisions, capabilities, tests, costs and schedules. |
| Electrical Engineer |
Confirm the electrical capabilities and limitations of the design, such as
overstress, operating restrictions, etc. |
| Mechanical Engineer |
Evaluate design in terms of packaging, environment, handling, strength of
material, etc. |
| Software Engineer |
Ensure operational compatibility; hardware to software, evaluate interfaces. |
| Manufacturing Engineer |
Evaluate design in terms of manufacturing limitations, cost and schedule. |
| Quality Engineer |
Substantiate the quality methods employed and implemented. |
| Reliability Engineer |
Evaluate design for capability versus the customer need. |
| Human Factor Engineer |
Identify man-machine interface capability and limitations. |
| Customer Representative |
Request investigations, challenges the design, determines acceptability of
design. |
Informal Reviews. These reviews are generally conducted to help the product designer assess the degree of maturity in the design process. Reliability verification of stresses, component failure rates, fault tolerant operation and modeling are provided for the purpose of evaluating and guiding the designer in specific areas of product reliability. These reviews are usually informal and conducted during the Concept/Planning phase, or very early in the Design/Development phase, when major product design changes may be considered.
From a reliability perspective, the assessment review should accomplish at least the
following:
- Detect conditions that degrade reliability.
- Provide assurance of meeting the customer's reliability needs.
- Ensure use of preferred components.
- Ensure design safety margins are adhered to.
- Ensure quality management is integrated into the process.
- Verify that stress analyses of components, have been performed where needed.
- Confirm that fault tolerance or fail soft designs have been used for critical applications.
- Evaluate critical items and control procedures.
Examples of Design Review Checklists. Design review checklists include specific
questions that should be considered when a product is scheduled for review. Typical
checklists for a reliability review during the product Concept/Planning and
Design/Development phases are provided as examples in Tables 7 and 8.
Table 7. Concept/Planning Phase-Reliability Review Checklist
| Questions |
Remarks |
| Product design concept meets minimum
customer reliability expectations? |
Reliability modeling, fault tolerance, component
selection should be examined. |
| Safety margins are sufficient for operation? |
Standard criteria for safety, fault tolerance, strength
of materials should be reviewed. |
| Numerical reliability estimates meet allocated
needs? |
Cooling, quality, redundancy, parts count reduction
and lower stress levels should be considered. |
| Product can operate in the expected
environment? |
Cooling, vibration, shock, packaging, components
and stress are all examined. |
| Stress derating strategy for components is
defined? |
Derating criteria should be documented. |
| Critical components are identified? |
Define, examine, analyze, and test components for
criticality. |
| Limited life items are identified? |
Inspection, handling, testing, and replacement techniques should be considered. |
| Test or operational data is available to ascertain product performance? |
Evaluation technique, failure trends, operating environment should be examined. |
| Trade-off studies have been performed? |
Includes reliability performance, better parts, cooling, power, speed, complexity and others. |
Table 8. Design/Development Phase-Reliability Review Checklist
| Questions |
Remarks |
| Reliability design goals/objectives at each level
achieved? |
Allocations, models, predictions and tests are
evaluated. |
| Performance indicators are included in the
design? |
Fault flags, software testing, built-in-test parameters
need to be estimated. |
| Critical parts are identified? |
Spares, maintenance, operating procedures need to
be assessed. |
| Preferred parts and components selected? |
Known capabilities and quality levels are needed. |
| Safety margins are sufficient for each
component and subassembly? |
Allocation of standard criteria is performed. |
| Derating of component stress is implemented? |
Standard design levels for better performance
considered. |
| Fault tolerance included in product design? |
Fail soft conditions need to be evaluated. |
| Early failure and wearout problems identified? |
Limit conditions, testing, and inspection criteria are
defined. |
| Environmental conditions match the component
profiles? |
Extra cooling, stress reduction or better components
are evaluated. |
| Failure modes for components are identified? |
Failure mode analysis, test and historical data
evaluated. |
| Single failure points and their impact on the
product have been identified? |
Failure mode criticality analysis needed; identifies
areas for redundancy. |
| Software reliability impact has been assessed? |
Code failures, design flaws, specification errors
accounted for. |
| Adequate corrosion protection? |
Environment and protection need to be evaluated. |
| Protection devices are included? |
Fuses, circuit breakers, sprinklers need to be
considered. |
3.4 Supplier Control
3.4.1 Purpose. The purpose of supplier control is to provide the producer with appropriate surveillance and management information to ensure that supplied items meet the requirements necessary for the overall product to meet its performance and reliability objectives. The control procedures should assess the status of part selection, quality of assembly, robustness of design, corrective and preventive reliability improvement activities and ability to meet needs of the customer.
3.4.2 Benefits. The benefits of a tightly controlled supplier program include:
- Improved product performance
- Reduced corrective and preventive actions
- Less interface design problems
- Shortened development and test times
3.4.3 Timing. A supplier control program is an on-going process that starts with the
selection of the vendors and continues through the manufacture of the product. Actions
like failure reporting would continue after hardware is built, but product improvement
could start as the design evolves.
3.4.4 Application Guidelines. Supplier control can be implemented at many different levels, from no controls to requirements for detailed records. For the "no control" situation, the product builder assumes most of the risk unless specific warranty requirements are agreed upon. Other control levels may include test or inspection of each item at the receiving dock. For more detailed control, past performance data, supplier quality procedures and failure corrective action documentation can be requested as part of a contract. Table 9 illustrates the basic tasks that should be considered for different supplier products.
Table 9. Product Types and Tasks Recommended
| Tasks/Product |
Off-the-Shelf |
New Development |
High Volume |
Critical Items |
| Warranty Contract |
X |
|
X |
|
| Failure Reporting |
|
X |
|
X |
| Corrective Actions |
|
X |
|
X |
| Product Improvements |
|
X |
X |
X |
| Process Improvements |
|
|
X |
X |
| Historical Data |
X |
|
|
X |
| Testing |
|
X (100%) |
X (Sample) |
X (100%) |
| Inspections |
|
X (Sample) |
X (Sample) |
X (100%) |
Warranty Contract - The term "warranty" is a promise or affirmation expressed or implied by a supplier regarding the nature, usefulness, or condition of the item, or the performance of services. The contract should identify the item, acceptable use conditions, and warrantor's liabilities. Four basic principles apply to most warranties:
- They are not free
- Items will still fail
- They do not ensure the quality of performance
- They define the minimum level of quality or performance
Commercial manufacturers include the cost of a warranty in the selling price of the
warranted item prior to sale, where government products have separate price factors for
warranties.
Failure Reporting - Failure reporting is the collection of all types of data, including
manufacturing tests, acceptance tests, burn-in tests, quality tests and warranty returns. The main objective of this task is to determine or define failure trends and problem
areas.
Corrective Actions - This task begins with the collection of failure data, then proceeds
with detailed failure analysis to determine the root failure causes. Based on failure
analysis and root cause determination, corrective action to prevent or correct the failure
from reoccurring can be determined. The process concludes with the incorporation of
the corrective action into the failed items and testing to validate its effectiveness. A
failure corrective action flow and evaluation checklist is provided as Table 10.
Product Improvements - Many design and manufacturing procedures and methods can
improve the product reliability. These can include:
- Simplification of the product (reduce parts count)
- Stress reduction or strength enhancement
- Higher quality components
- Redundant or alternate paths of success
- Testing to eliminate defects
Process Improvements - Process improvements include factors such as automating manufacturing steps; reducing the number of processes; using statistical design of experiments; use of quality councils or process action teams; or benchmarking the organization's performance against recognized leaders in the field.
Another approach is to apply statistical process control with the intention of limiting the variation in the process output. Identically manufactured parts will always vary in size, strength, defects and other factors. If the variation is too great customers may not be satisfied. Representative control charts are shown in Table 11 that can be used to assess process variation.
Historical Data - The use of historical data to determine how parts have performed in the past is very important in the supplier selection and control process. These data represent the achieved reliability capabilities in actual operation. Determining the accuracy of the data is the main control problem. When reliability numbers are obtained from the supplier, the following supplemental information needs to be included, to authenticate the results:
- Operating environment parameters
- Power cycling characteristics
- Modes of operation
- Number of units comprising the database
- Number of test or operating hours
- Number of failures and/or interruptions to the part operation
- If and why failures are excluded
- Copy of test logs or reports are desirable
Evaluation of the data should involve the determination of best case and worst case results. The best case analysis uses only those failures that are inherent in the design or manufacturing process, where the worst case analysis considers everything. The evaluation should also look at possible failure trends, that is two or more similar events. If failure trends are identified, the manufacturer can either select another vendor or ask if corrective action has been performed and verified.
Excerpt from "Table 10. Flow and Evaluation Checklist for Corrective Action" See Full Version
Excerpt from "Table 11. Product and Process Control Charts" See Full Version
Testing - A test program should be considered when critical components with unknown
reliability history are being procured or safety concerns are prevalent. A growth test
that improves reliability over the long term is usually the most effective.
Inspection - Inspections are quality control procedures that the vendor implements internally or that the buyer may impose prior to accepting product delivery. Inspections are usually performed after the product has been manufactured, so it can only be used as a defect identification and removal process.
3.5 Design of Experiments (DOE)
3.5.1 Purpose. Experimental designs consist of a series of specific changes to the input parameters of a process or product in order to assess the corresponding change to the output. By applying design of experiments (DOE), the individual effects of a complex system of multiple factors can be studied simultaneously, thereby avoiding inefficient testing of one factor at a time. This approach is a scientific methodology that allows the manufacturer to better understand the product or process and how multiple inputs may affect its performance.
3.5.2 Benefits. Experimental design, when performed correctly, can result in the following product or process improvements:
- Improved performance
- Selection of less costly materials
- Reduced production costs
- Control of critical factors
- Shortened development time
- Reduced test time
- Relaxed design/process tolerances
- Higher levels of reliability
3.5.3 Timing. Design of experiments can be performed to influence product design at
any time from Concept/Planning through Production/Manufacturing. The techniques
can be applied to product design, process design, test, and production evaluation. As an
assessment tool, it can be used when the process physical properties are known and
parameter interactions are understood.
3.5.4 Application Guidelines. Because there are numerous competing design of
experiment strategies, including full factorial, fractional factorial, Plackett-Burman,
Box-Burman and Taguchi orthogonal arrays, a detailed list of references is included in
Section Four to aid in proper technique selection. Each of the methods has its own
strengths and weaknesses that need to be considered based on the application. For the
purposes of this Blueprint, a general process for an orthogonal array will be discussed.
(1) The process starts by selecting the factors to be tested.
This requires the
development of a "short list" of significant factors often determined through a team
effort by "brainstorming" ideas. (2) After selecting the short list, controlling and non-
controlling factors along with test settings need to be developed. Usually a high and a
low setting is determined for each factor and they are coded "+" and "-". More than two
settings could be necessary if the distribution of the factors is like the data in Figure 6,
which required five settings. Even for a two setting factor, the range between the high
and low factor values should be chosen carefully. (3) The next step is to set-up an
orthogonal array that permits the separation of effects. Table 12 shows a typical two
factor array with two settings, along with the analysis equations to determine the
average and expected outputs. The variables y1 through y4 are the test measurements
based on the factor settings. For example, y1 represents a test utilizing a high setting
for factor A, a low setting for B and a high setting for AB factor interaction. Each of
these tests is performed at least once, with repeats if uncontrolled conditions change. It should also be noted that test results can be biased by a factor or factors not tested. As
a result, a confirmation test should be performed to verify or disprove the calculated
optimum solution.
Figure 6. Selecting Test Settings (Click to Zoom)
Table 12. Orthogonal Array
| Run |
Factors (Test Setting) |
Interaction
(By-Products) A*B |
Results (Measured) |
| A |
B |
| 1 |
- |
- |
+ |
y1 |
| 2 |
+ |
- |
- |
y1 |
| 3 |
- |
+ |
- |
y3 |
| 4 |
+ |
+ |
+ |
y4 |
| AVG- |
( y1 + y3 ) / 2 |
( y1 + y2 ) / 2 |
( y2 + y3 ) / 2 |
|
| AVG+ |
( y2 + y4 ) / 2 |
( y3 + y4 ) / 2 |
( y1 + y4 ) / 2 |
y = (y1 + y2 + y3 + y4) / 4 |
| Δ |
(Avg +) - (Avg -) for each column |
|
| y |
y + (ΔA / 2)A + (ΔB / 2)B + (Δ(A * B) / 2) (A * B) |
where
y = expected output
y = average output
ΔA =(AVG +) - (AVG -) values from column A in matrix
A = coded value of A (high setting = +1, low setting = -1)
Example of a Fractional Factorial Design
An integrated circuit manufacturer had determined that a weak bond between a die and
an insulated substrate has resulted in many field failures. A designed experiment was
conducted to maximize bonding strength.
Step 1 - Determine Factors: It is not always obvious which factors are important. A
good way to select factors for a DOE is through organized "brainstorming". For our
example, a brainstorming session was conducted which identified four factors believed
to affect bonding strength: (1) epoxy type, (2) substrate material, (3) bake time, and (4)
substrate thickness.
Step 2 - Select Test Settings: Often, as with this example, only two test settings
("high" and "low") for each factor are identified. This is referred to as a two-level
experiment. (Design of Experiments techniques can be used for more than two-level
experiments.) The four factors and their associated high and low settings for the
example are shown in Table 13. The selection of high and low settings is arbitrary (e.g.
gold eutectic could be "+" and silver could be "-").
Table 13. Factors and Settings
| Factor |
Levels |
| Low (-) |
High (+) |
A. Filled Epoxy Type
B. Substrate Material
C. Bake Time (at 90°C)
D. Substrate Thickness |
Gold
Alumina
90 Min
0.025 in |
Silver
Beryllium Oxide
120 Min
0.05 in |
The steps involved in performing an analysis of variance for this example are:
- 2A. Calculate Sum of Squares: The test data from Table 14 is used to calculate the sum of squares. For this particular experimental design, the sum of squares for the main factors and interactions are easily calculated. The calculation for factor A (filled epoxy type) is illustrated below.
Table 14. Interactions, Aliasing Patterns and Average "+" and "-" Values
| Treatment
Combination |
A or
BCD |
B or
ACD |
AB or
CD |
C or
ABD |
AC or
BD |
BC or
AD |
D or
ABC |
Bonding
Strength*
y |
1
2
3
4
5
6
7
8 |
-
-
-
-
+
+
+
+ |
-
-
+
+
-
-
+
+ |
+
+
-
-
-
-
+
+ |
-
+
-
+
-
+
-
+ |
+
-
+
-
-
+
-
+ |
+
-
-
+
+
-
-
+ |
-
+
+
-
+
-
-
+ |
73
88
81
77
83
81
74
90 |
| Avg (+) |
82 |
80.5 |
81.25 |
84 |
81.25 |
80.75 |
85.5 |
|
| Avg (-) |
79.75 |
81.25 |
80.5 |
77.75 |
80.5 |
81 |
76.25 |
| Δ= Avg(+) - Avg (-) |
2.25 |
-0.75 |
0.75 |
6.25 |
0.75 |
-0.25 |
9.25 |
*The mean bonding strength calculated from this column is 80.875.
Sum of Sq. (Factor A) = [# of treatment combinations / 4] [Avg(+) - Avg(-)]2
Sum of Sq. (Factor A) = (8/4) (2.25)2 = 10.125 (see Table 16)
- 2B. Calculate Error: The sum of squares for the error in this example is set equal to the sum of the sum of squares values for the three two-way interactions (i.e., AB or CD, AC or BD, BC or AD). This is known as pooling the error. This error is calculated as follows: Perform a sum of squares analysis for the interactions AB or CD, AC or BD and BC or AD. The pooled error is determined by summing the results. Error = 1.125 + 1.125 + 0.125 = 2.375.
- 2C. Determine Degrees of Freedom: Degrees of freedom, denoted df, is the number of levels of each factor minus one. Degrees of freedom is always 1 for factors and interactions for a two level experiment. As shown in this simplified example, degrees of freedom for the error (dferr) is equal to 2 since there are 3 interaction degrees of freedom.
- 2D. Calculate Mean Square: Mean square equals the sum of squares divided by the associated degrees of freedom. Mean square for a two level, single replicate experiment is always equal to the sum of squares for all factors. Mean square for the error is equal to the sum of squares error term divided by 2 (2 is the df of the error).
- 2E. Perform F-Ratio Test for Significance: To determine the F ratio, divide the mean square of the factor by the mean square error. This resulting quotient is distributed according to the F distribution, and is compared to the value defining the critical region. F (α, dfF, dferr) represents the critical value of the distribution and can be found tabulated in most statistics books. If the F ratio is greater than the critical value (larger than could be expected by chance), then the null-hypothesis - the factors studied had no effect on the response - is rejected, and the factor is assumed to have a significant effect on the response variable. Alpha (α) represents the risk of assuming the factors had no effect on the product when they actually do. For this example, assuming a 10% risk, the critical value is F (.1,1,2) = 8.53.
As a word of caution, the above formulations are not intended for use in a cookbook
fashion. Proper methods for computing sum of squares, mean square, degrees of
freedom, etc., depend on the type of experiment being run and can be found in
appropriate design of experiments reference books.
Step 3 - Set Up An Appropriate Design Matrix: Investigating all possible
combinations of four factors, each at two levels, would require 16 (i.e., 24)
experimental tests. This type of experiment is referred to as a full factorial. However,
in this example a half replicate fractional factorial with eight tests was used. This
decision was made to conserve time and resources.
The resulting design matrix is shown in Table 15.
The order of the test runs is
randomized to minimize the possibility of outside effects contaminating the data. For
example, if the tests were conducted over several days while the temperature changed
slightly, randomizing the various test trials would minimize the effects of room tem-
perature on the experimental results. The matrix is orthogonal, which means that it has
the correct balancing properties necessary for each factor's effect to be studied
statistically independent from the others. Procedures for setting up orthogonal matrices
can be found in any of the references cited.
Table 15. Orthogonal Design Matrix With Test Results
| Treatment Combination |
Random Trial Run Order |
Factors |
Bonding Strength (psi) y |
| A |
B |
C |
D |
1
2
3
4
5
6
7
8 |
6
5
3
8
4
2
7
1 |
-
-
-
-
+
+
+
+ |
-
-
+
+
-
-
+
+ |
-
+
-
+
-
+
-
+ |
-
+
+
-
+
-
-
+ |
73
88
81
77
83
81
74
90 |
Mean y = ∑ (yi / 8) = (647 / 8) = 80.875
Step 4 - Run The Tests: The eight test combinations are run randomly as defined by
the second column in the table. The run order is determined by a random number table
or any other type of random number generator. Resultant bonding strengths from the
testing are shown in Table 15.
Step 5 - Analyze The Results: This step involves performing statistical analysis to
determine which factors and interactions have a significant effect on the bond strength.
Shown previously in Table 14 is the studied as a result of running only a fractional
replicate. This loss of analysis capability is defined by the aliasing patterns in Table 14,
and is considered the penalty for not performing a full factorial experiment (i.e.,
checking every possible combination of the factors). Aliases are defined as two or more
effects that share the same numerical value. For example, the effect on the bond
strength caused by "A or BCD" (column 2) cannot be differentiated between factor A or
the interaction of BCD. The assumption is usually made that the effects of higher order
interactions such as BCD are negligible and the impact on the response variable was a
result of the main factor. Aliasing patterns are unique to each experiment and must be
evaluated for reasonableness.
These procedures are described in many Design of
Experiments textbooks. An analysis of variance is then performed to determine which
factors had a significant effect on bonding strength. The results are is shown in Table
16.
Table 16. Results of Analysis of Variance
| Source |
Sum of
Squares |
Degrees of
Freedom |
Mean
Square |
F Ratio* |
Significant
Effect |
Epoxy Type (A)
Substrate Material (B)
No
Bake Time (C)
Substrate Thickness (D)
A x B or C x D
A x C or B x D
B x C or A x D Error |
10.125
1.125
78.125
171.125
1.125
1.125
0.125
2.375 |
1
1
1
1
1
1
1
2 |
10.125
1.125
78.125
171.125
--
--
--
1.188 |
8.52
0.95
65.76
144.04
--
--
--
-- |
Yes
No
Yes
Yes
--
--
-- |
*Example Calculation: F = Mean Square / Error = 10.125 / 1.188 = 8.52
Step 6 - Calculate Optimum Settings: From the analysis of variance, the factors A, C, and D were found to have the largest effect on the bond strength. In order to maximize the bonding strength response, the optimum settings can be determined by inspecting the following prediction equation:
y = y (mean bonding strength) + (ΔA / 2) (A) + (ΔC / 2) (C) + (ΔD / 2)
y = (80.875) + 1.125A + 3.125C + 4.625D
Since A, C, and D are the only significant factors, they are the only ones found in the prediction equation. Further, because they all have positive coefficients they must be set at high to maximize bonding strength. Factor B, substrate material, did not significantly affect bonding strength, therefore the choice of material should be based on cost. An economic analysis should always be performed to ensure that all decisions resulting from designed experiments are cost-effective.
Step 7 - Perform Confirmation Test Run: Since there may be important factors not considered, the optimum settings must be verified by test. If a confirmation test supports the DOE results, the job is done. If not, new tests must be planned.
3.6 Dormancy Analysis
3.6.1 Purpose. The purpose of performing a dormancy analysis is to assess the effects of environmental storage parameters on product characteristics such as performance, lubrication. A well defined analysis will isolate problem areas that should be candidates for design or process change.
3.6.2 Benefits. Applying analysis early to a product that will experience extended nonoperating conditions will result in lower product life cycle costs, higher product reliability, reduction of experienced failure mechanisms and ultimate customer satisfaction. For those times when long storage periods are expected, a dormancy analysis can determine if periodic testing is necessary to ensure proper operation.
3.6.3 Timing.Historically, all storage and dormancy analyses resulted from the experience of taking a product off the shelf, attempting to operate it and finding that it had failed. Subsequent crash fix-it programs usually resulted in the correction of the immediate problem, after the fact, at high costs. The best time to assess nonoperating Design/Development phase, where part and protective measures can be assessed. Planning for potential dormancy situations should be initiated during the Concept/Planning phase of the product, when its end use environment is beginning to be characterized.
3.6.4 Application Guidelines. When assessing a product for dormant or storage conditions, two levels of analysis or evaluation should be considered. The first level is the estimation of the product reliability under the given conditions. The second level is a physical evaluation of component characteristics for corrosion, material creep susceptibility and lubrication requirements.
Estimation of product reliability can be accomplished in several ways, including use of historical nonoperating data, part prediction using nonoperating failure rates, or by using conversion factors from operating part failure rates. The results of the analysis can be applied to the life cycle cost and other performance models. If the reliability estimate doesn't meet expected needs, then protective measures, such as containers, heating or cooling fixtures, humidity control, etc., need to be considered to improve the product robustness.
- Historical data. The process of determining product reliability from historical data involves the conversion of nonoperating times and the number of failures for each component into the product failure rate model. For example, if three components each had 100,000 hours of nonoperating time, component A and B had one failure each and component C had two failures, the resulting product reliability failure rate estimate would be:
product (failure rate) = A (failure rate) + B (failure rate) + C (failure rate)
= 1/100,000 + 1/100,000 + 2/100,000
= .00001 + .00001 + .00002
= .00004 failures/hour
- Part prediction. The types and quantities of parts need to be listed, then failure rates for each part determined from a data source. For example, RADC-TR-85-91 "Impact of Nonoperating Periods on Equipment Reliability" has nonoperating part failure rate algorithms. The individual part types, quantities and failure rates are multiplied and summed to arrive at a product value.
- Conversion factors. This method requires the establishment of a list of part types, quantities and the use of a conversion factor. The part type operating failure rate is determined from an acceptable source, then a conversion of operating to nonoperating failure rates is applied. The conversion factor could be as simple as a ten to one reduction, or it could be from a table of factors such as those shown in Table 17 obtained from the "Reliability Toolkit:
Commercial Practices Edition".
Table 17. Dormant Conversion Factors (Multiply Operating Failure Rate By)
| Part Types |
Ground Active To Ground Passive |
Airborne Active To Airborne Passive |
Airborne Active To Ground Passive |
Naval Active To Naval Passive |
Naval Active To Ground Passive |
Space Active To Space Passive |
Space Active To Ground Passive |
| Integrated Circuits |
.08 |
.06 |
.04 |
.06 |
.05 |
.10 |
.30 |
| Diodes |
.04 |
.05 |
.01 |
.04 |
.03 |
.20 |
.80 |
| Transistors |
.05 |
.06 |
.02 |
.05 |
.03 |
.20 |
1.00 |
| Capacitors |
.10 |
.10 |
.03 |
.10 |
.04 |
.20 |
.40 |
| Resistors |
.20 |
.06 |
.03 |
.10 |
.06 |
.50 |
1.00 |
| Switches |
.40 |
.20 |
.10 |
.40 |
.20 |
.80 |
1.00 |
| Relays |
.20 |
.20 |
.04 |
.30 |
.08 |
.40 |
.90 |
| Connectors |
.005 |
.005 |
.003 |
.008 |
.003 |
.02 |
.03 |
| Circuit Boards |
.04 |
.02 |
.01 |
.03 |
.01 |
.08 |
.20 |
| Transformers |
.20 |
.20 |
.20 |
.30 |
.30 |
.50 |
1.00 |
Example of a Satellite Receiver Conversion. To convert the reliability of an
operating satellite receiver to a nonoperating condition, determine the number of parts
by type and quantity, then multiply each by the respective operating failure rates
obtained from handbooks or experience data. The total operating failure rate for each
type is then converted using the conversion factors from Table 17. The dormant or
nonoperating estimate of reliability for the satellite receiver is determined in Table 18.
Table 18. Example of a Satellite Receiver Operating to Nonoperating Conversion
| Part Type |
Quantity |
Operating
Failure Rate
(per 106 hours) |
Failure Rate
X Quantity
(per 106 hours) |
Conversion
Factor
|
Nonoperating
Failure Rate
(per 106 hours) |
| Integrated Circuit |
100 |
0.06 |
6.00 |
0.1 |
0.60 |
| Diode |
100 |
0.001 |
0.10 |
0.2 |
0.02 |
| Transistor |
100 |
0.003 |
0.30 |
0.2 |
0.06 |
| Resistor |
100 |
0.002 |
0.20 |
0.5 |
0.10 |
| Capacitor |
100 |
0.001 |
0.10 |
0.2 |
0.02 |
| Switch |
10 |
0.05 |
0.50 |
0.8 |
0.40 |
| Transformer |
10 |
0.03 |
0.30 |
0.5 |
0.15 |
| Connectors |
10 |
0.08 |
0.80 |
0.02 |
0.02 |
| Circuit Board |
5 |
0.50 |
2.50 |
0.08 |
0.20 |
| Total Failure Rate (per 106 hours) |
10.80 |
|
1.57 |
| Mean-time-between-failure (hours)
(1/failure rate) |
92,592 |
|
636,942 |
3.7 Durability Analysis
3.7.1 Purpose. The primary purpose of a durability analysis is to identify components and processes that exhibit "early" wearout failure, isolate the root cause and determine potential corrective actions.
3.7.2 Benefits. The benefits of an effective durability analysis are fewer failures experienced during the useful life of the product and greater customer satisfaction with the product. For the design team, the durability analysis provides detailed analytical models that assess the physical relationships between the product application and the operating environment.
3.7.3 Timing.
Durability analysis should be performed whenever component or process problems are suspected and identified.
Limitations that may inhibit the assessment include a lack of knowledge regarding material characteristics,
environmental stress levels, product operating parameters and product use factors. Early application in the Design/Development phase is desirable for "critical components" or known problem areas and planning for durability analysis should be performed in the Concept/Planning phase when material characteristic issues are suspected. If the problem areas cannot be efficiently defined, "shotgun" analysis is not recommended due to high costs.
3.7.4 Application Guidelines. Durability analysis is an analysis that focuses on identifying and solving design problems related to early product or materials wearout. This procedure is especially important for mechanical products where the assessment is performed by evaluating life-cycle loads and stresses, product architecture, material properties, and failure mechanisms. Figure 7 illustrates the concept of reliability, measured as a failure rate, and durability, measured as a time duration.
Figure 7. Reliability vs. Durability (Click to Zoom)
The basic approach to durability analysis, which is applicable to either new or old technology, is outlined in Table 19.
Table 19. Basic Approach to Durability Analysis
| Step |
Discussion |
| 1. Define the operating and nonoperating life requirements |
Length of time or number of cycles expected or needed for both operating and nonoperating periods should be determined. |
| 2. Define the life environment |
Temperature, humidity, vibration and other parameters should be determined so that the load environment can be quantified and the cycle rates determined. For example, a business computer might expect a temperature cycle once each day from 60°F to 75°F ambient. This would quantify the maximum and minimum temperatures and a rate of one cycle per day. |
| 3. Identify the material properties |
Usually this involves determining material characteristics from a published handbook. If unique materials are being considered, then special test programs will be necessary. |
| 4. Identify potential failure sites |
Failure areas are usually assumed to fall into categories of new materials, products or technologies. Considerations should include high deflection regions, high temperature cycling regions, high thermal expansion materials, corrosion sensitive items, and test failures. |
| 5. Determine if a failure will occur within the time or number of cycles expected |
A detailed stress analysis using either a closed form or finite element simulation method should be performed. Either analysis will result in a quantifiable mechanical stress for each potential failure site. |
| 6. Calculate the component or process life |
Using fatigue cycle curves from material handbooks, estimate the number of cycles to failure. The following figure shows a typical fatigue curve for stress versus cycles to failure. Specific material fatigue data can be obtained from databases maintained by the Center for Information and Numerical Data Analysis and Synthesis (see reference section).
|
Example of a Durability Analysis. Determine the average failure rate of a pinion
during the first 1,500 hours of operation given a speed of 90,000 revolutions per hour.
The L10 life of the pinion is 450 x 106 revolutions with a Weibull slope of 3.0. L10 life
is the length of time that 90% of the pinions will meet or exceed during use before they
fail. Table 20 illustrates the steps involved.
Table 20. Example of a Pinion Durability Analysis
| Step |
Parameters and Calculations |
| 1. Identify the pinion life characteristics |
• L10 = 450 x 106 revolutions
• Weibull slope (β) = 3.0
• Speed = 90,000 revolutions/hour |
| 2. Convert L10 revolutions to hours |
L10 (Hours) = (L10 Revolutions) / (Revolutions/Hour)
= (450 x 106) / 90, 000 = 5, 000 |
| 3. Determine the characteristic life using the Weibull cumulative distribution function |
 |
| 4. Compute the failure rate for 1,500 hours |
.jpg) |
3.8 Failure Modes, Effects and Criticality Analysis (FMECA)
3.8.1 Purpose. The purpose of any Failure Modes, Effects and Criticality Analysis (FMECA) is to examine failure modes of components, functions, or processes and determine the impact these failures have on the product. The information developed is used for elimination of problems, evaluation of design corrective actions, and design of fault detection.
3.8.2 Benefits. The systematic nature of a failure mode analysis assures that every product-level failure effect above the level under evaluation is considered. The benefits of a systematic analysis include early highlighting of potential operational problems,making functional failures less critical, eliminating cascading failures and identifying critical items requiring control. This type of design analysis leads to a more reliable product design.
3.8.3 Timing.
A part level FMECA can be initiated as soon as design and configuration information at that level becomes available. Analysis at higher levels, such as the functional level, can be initiated earlier in the development cycle. Assessment should continue throughout the product development cycle so that design changes and alternate approaches can be evaluated, and their effects accounted for and rectified, as appropriate.
3.8.4 Application Guidelines. Any product or process should be considered for a FMECA, especially if the item is needed for a critical function such as control systems, safety monitors, nuclear energy or flight controls. A general process flow for a FMECA is illustrated in Figure 8.
Figure 8. FMECA Flow Diagram (Click to Zoom)
The process flow indicates the need for detailed data on components, interfaces, environments, process flows and operating modes. From these data the failure modes and effects of each part can be analyzed and documented. Recommended changes can be developed based on the documented results and corrective action instigated if deemed necessary.
Functional Approach. The functional Failure Modes and Effects Analysis (FMEA) approach is the preferred technique when design definition is incomplete, as in the early stages of design when specific hardware items cannot be uniquely identified. Two basic methods are typical, the FMEA procedure and the Criticality Analysis. The information required to perform a functional FMEA includes the identification of each product function (and its associated failure modes) for each functional output. A generic worksheet for an FMEA is illustrated in Figure 9. The worksheet columns are used as follows:
- Identify the product
- List the product functions
- Define the functional failure for each function
- Determine the failure modes for each functional failure cause
- Determine the function and product effects for each failure mode
- Estimate the severity of the failure mode (typically defined as catastrophic, critical, marginal or minor)
- Determine the cause that resulted in the failure
- Evaluate and recommend corrective actions
| Product:______________ |
|
Analyst:____________
Date:____________ |
| Function |
Failure Modes |
Local Effect |
End Effect |
Severity |
Cause |
Action |
|
|
|
|
|
|
|
Figure 9. FMEA Worksheet
When a criticality analysis is desired, more information in the form of a relative measure of the consequence will result. It should be noted that criticality analyses are difficult to perform for a functional FMEA due to the lack of detailed failure data at this level. If failure data are available, criticality numbers are developed as follows:
Failure Mode Criticality Number = ( α ) x (frequency) x (hours or cycles) x ( β )
where,
| α |
= the percentage for occurrence of each failure mode |
| frequency |
= the rate of occurrence |
| β |
= the best estimate of the percentage of occurrence of the effects
(probability that the effect will occur) |
Hardware Approach. This assessment technique requires the availability of a list of individual items or parts, design drawings, block diagrams, a description of product operating modes and other factors. The specific failure modes of each item are identified, as well as the corresponding failure effects at the next higher level of assembly and, ultimately, at the product level. This technique is especially good for analysis of modified hardware, as unique subassemblies can be analyzed without resorting to a complete product analysis. The hardware approach is a bottom-up technique.
Example of a Hardware FMEA. A security system, used 12 hours per day, has a five volt regulator as shown in Figure 10. The system has two modes of operation, the scan mode and the alert mode. The primary product objective is to sound an alarm in case of intrusion. The product-level failures can be classified as Category I, loss of alarm; Category II, false alarm; Category III, degraded operation; or Category IV, no effect.
Figure 10. Security System 5VDC Regulator (Click to Zoom)
Part of the detailed analysis is shown in Figure 11. The parts and failure modes are clearly identified, and effects are determined via analysis of the schematic drawing. Severity classes are determined so that compensating features can be considered if necessary to maintain the integrity of the product.

Excerpt from "Figure 11. Failure Mode and Effects Analysis" See Full Version
Process Approach. A process FMEA is a different method for identifying potential or known process failure modes and providing problem follow-up and corrective action guidelines. The intent of the Process FMEA is to identify and correct known or potential failure modes that can occur during the product development process, prior to the first production run, particularly as a result of the product manufacturing and assembly processes. Once failure modes and causes have been determined, each failure mode is ranked similarly to the methods described and used previously. The Process FMEA has the greatest impact in the early stages of process design, before commitment to any machines, tools or facilities. Each process variable must be identified and analyzed for its potential modes of failure and recorded in the Process FMEA. Failure modes are determined by analysis of potential process flow problems that can occur during a production run.
Using a worksheet such as the one shown in Figure 12, the probability of each failure mode occurrence is ranked on a "1" to "10" scale and listed on the form. The absolute number of failure occurrences assigned to a ranking is at the discretion of the analyst, but must be consistent throughout the analysis. The severity of each potential failure effect is also ranked on a scale of "1" to "10" and recorded on the form. This factor represents the seriousness of a failure consequence to the end user. A defect detection factor, again ranging from "1" to "10", estimates the probability of detecting a defect before a part or component leaves the manufacturing or assembly area. This factor is also recorded on the form.

Excerpt from "Figure 12. Process FMEA Worksheet" See Full Version
A risk priority number (RPN) for each potential failure mode is calculated by multiplying the occurrence, severity and detection ranking factors for all process failure modes. Each RPN is listed on the form. Failure modes with the highest RPN's and occurrence ranking should be given priority for corrective action and change implementation.
Example of a Roof Installation. A process risk priority analysis for part of a roof installation (only the tasks for installing roll roofing, nailing shingles and installing flashing) is illustrated in Table 21.
Table 21. Roof Installation FMEA Example
| Task
Description |
Error |
Severity
(S) |
Occurrence
(O) |
Detection
(D) |
Risk Priority
Number (RPN) |
| Install 90# roll
roofing |
Not installed
Gap between Aluminum & roll roofing
Rippled
Punctured |
10
10
x
7
8 |
2
6
x
7
5 |
10
10
x
6
10 |
200
600
x
294
400 |
| Nailing shingles |
Nails missing
Nails bent
Nails too short
Nails loose
Nails misplaced
Nails too deep |
10
9
9
10
10
10 |
7
2
3
6
9
7
|
10
10
8
7
10
7 |
700
180
216
420
900
490 |
| Install chimney flashing |
Not installed
Loose
Too short |
10
8
8 |
1
4
6 |
2
3
9 |
20
96
432 |
When analyzing the results of this example, the highest risk priority numbers are "nails misplaced" or "nails missing". These two items should be considered candidates for a process change that could include either training or additional inspection.
3.9 Failure Reporting and Corrective Action System (FRACAS)
3.9.1 Purpose.
A Failure Reporting, Analysis and Corrective Action System (FRACAS) accumulates failure and corrective action information to assess progress in eliminating hardware, software and process failure modes and mechanisms. It should contain the detailed data necessary to identify design or process deficiencies for correction.
3.9.2 Benefit. FRACAS analysis provides information needed for the timely identification and correction of design errors, part problems, workmanship defects and/or manufacturing and administrative process errors. Continual tracking of data in FRACAS provides an assessment as to whether previous failure trends have been eliminated through corrective action.
3.9.3 Timing. FRACAS requires a source of data before it can be implemented. Once hardware/software begin to become available, and the definition and implementation of processes has begun, a working FRACAS should be in place and failure data collected by the manufacturer from any tests and operational usage (Design/Development through Production/Manufacturing). The FRACAS should remain in use as long as the product is being supported by the manufacturer (i.e., through the Operation/Repair phases of the product). Customers may, and should, have their own FRACAS to identify operational reliability problems for correction during their use of the product.
3.9.4 Application Guidelines.A comprehensive FRACAS closed-loop diagram is shown in Figure 13.
- Observation of the failure
- Complete documentation of the failure, including all significant conditions which existed at the time of the failure
- Failure verification, i.e., confirmation of the validity of the initial failure observation
- Failure isolation, localization to the lowest replaceable defective item within the product
- Replacement of the suspect defective item
- Confirmation that the suspect item is defective
- Failure analysis of the defective item
- Data search to uncover other similar failure occurrences and to determine the previous history of the defective item and similar related items
- Establishment of the root cause of the failure
- Determination, by an interdiscipline design team, of the necessary corrective action, especially any applicable redesign
- Incorporation of the recommended corrective action into development equipment
- Continuation of development tests
- Establishment of the effectiveness of the proposed corrective action
- Incorporation of effective corrective action into production equipment
Figure 13. Generic Closed-Loop FRACAS
The key to a successful FRACAS is its database. This is particularly important in establishing the significance of a failure. For example, the failure of a capacitor in a reliability growth test becomes more significant if the database shows similar failures during incoming inspection of the part and in any environmental tests performed. For this reason, all available sources of data should feed the FRACAS. Initial failure reports should document, as applicable:
- Location of failure
- Test being performed
- Date and time
- Part number and serial number
- Model number
- Failure symptom
- Individual who observed failure
- Circumstances of interest (e.g., occurred immediately after power outage)
The failure documentation should be augmented with the verification of failure (step 3 in Figure 13), and verification that the suspect part did indeed fail (step 6). The format of the failure reporting form should be determined by the manufacturer to best meet its needs for improving the product design reliability and assessing whether corrective actions have been effective.
Once the failure is isolated, the FRACAS database and failure analysis can be used to determine its root cause. Given the root cause, appropriate corrective action can be determined.
Failure analysis can be performed to various levels of detail, and may require coordination with the part supplier. The most critical failures (i.e., those that occur most often, are most expensive to repair, or threaten the user's safety) should receive in- depth analysis, perhaps including X-rays, scanning electron beam probing, etc., which require specialized equipment. Where the manufacturer does not have a comprehensive failure analysis laboratory, outside sources are available for use.
A sample failure reporting form that includes the minimum essential information to
make corrective action decisions is shown in Figure 14.
| FAILURE REPORT FORM XYZ COMPANY |
Model #: Computer #6161
|
Date of Occurrence: 10 April 96 |
| Time of Event: 0846 AM |
|
| Description of Event: |
Computer failed to perform correct computation |
| Event Observed by: |
P.C. Borde |
|
| Description of Repair: |
Replaced Accumulator Board #2 |
| Product Repaired by: |
Mike R. Sawft |
|
| Description of Failure Analysis: |
Replaced part no. IC-8086 (Intel). Part was submitted for
failure analysis, where it was determined that the failure cause was electrical overstress
(root cause: electrostatic discharge). |
| Part Analyzed by: |
J. Bush |
|
| Recommended Action: |
Use electrostatic grounding clips during all maintenance actions. |
| Report Prepared by: |
P. Tree |
Report Date: 14 April 1996 |
|
Figure 14. Sample Failure Reporting Form
3.10 Fault Tree Analysis (FTA)
3.10.1 Purpose. Fault Tree Analysis (FTA) is a top down failure consequence assessment technique that is useful in identifying safety concerns so that product modifications can be made. When used in the design stage, the results of the analysis will identify the cause(s) of product failures which may then be eliminated through good design practice. Updating the FTA to reflect design changes will assess whether previous problems have been eliminated, or new problems have been introduced.
3.10.2 Benefits. When FTA is applied in the design stage, the benefits that can be
derived include:
- Identification of single failure points
- Identification of safety concerns
- Evaluation of software and man-machine interfaces
- Evaluation of design change impacts
- Simplification of maintenance and trouble-shooting procedures
- Assessment of modifications or enhancements
3.10.3 Timing. An FTA can be performed as early as the product Concept/Planning phase; however, application in the early stages of Design/Development is the most informative. This technique is very good for assessing design progress in identifying the causes of failure in a product resulting from modifications and design corrective actions.
3.10.4 Application Guidelines. As a general assessment tool, FTA should be used for evaluation of complex products with regard to safety and reliability. This technique should be applied when the need to know what causes a hypothesized catastrophic event is important to the success of a product. Similar to a Failure Mode and Effects Analysis, FTA will identify major failure modes of the product resulting from lower level failures. The product design reliability can then be improved by eliminating the causes of those failures.
A basic Fault Tree Analysis relates an undesired event to possible causes through a tree-like network branching at "AND gates" and "OR gates." For example, Figure 15 shows a partial fault tree for the event that an automobile will not start. It shows the problem may be due to electrical or fuel problems and that one electrical problem could be the combination of a weak battery and an unheated garage on a cold day. Table 22 explains the symbology used.
Figure 15. Example FTA: Car Won't Start
Table 22. FTA Symbology
Cut Set Analysis. A cut set is a combination of basic events (the circles in Table 22) that result in the undesired event. When one basic event alone can cause the end event (a cut set of one element), it is referred to as a single point of failure. A minimum cut set is the smallest combination of events that will cause the end event. For example, the basic cut sets of Figure 16 are events (1 and 3), (2 and 4), (3) or (4). Since event 3 is a single point of failure, the cut set (1 and 3) is redundant. Since event 4 is also a single point of failure, the cut set (2 and 4) is also redundant. Hence, the minimum cut sets for Figure 16 are (3) or (4), two single points of failure. In a qualitative analysis of a fault tree, the smallest cut sets are given the most attention, with single points of failure considered first.
Figure 16. Fault Tree Analysis Problem (Click to Zoom)
As a more detailed example, Figure 17 presents a fault tree cut set for a smoke detector which is designed to emit an alarm in the presence of smoke. Each failure mode and possible cause is indicated.
Quantitative Fault Tree Analysis. When the probability of each basic event can be estimated, it is possible to compute a number, called the criticality, from which the relative importance of the event can be determined. The criticality number is computed by multiplying the probability of the basic event happening by the conditional probability that, given the occurrence of the basic event, the end event will happen. For example, consider the fault tree presented in Figure 18.
| 1. Undetected Smoke Failure |
9. SCR Fails Off |
17. R4 Short |
| 2. No Signal to Alarm Assembly |
10. C5 Short |
18. D4 Short |
| 3. Faulty Alarm Assembly |
11. Horn Fails |
19. R9 Open |
| 4. No Voltage/Low Voltage at Q3 |
12. Faulty Connector |
20. C1 Short |
| 5. R6 Misadjusted |
13. Degraded Battery |
21. R8 Short |
| 6. No Smoke in Chamber |
14. Q4 Low Output Current |
22. R1 Open |
| 7. Component Failure |
15. Q3 Low Output Current |
23. Defective Sensor |
| 8. R13 Short or Open |
16. R6 Open |
|
Figure 17. Fault Tree for a Smoke Detector
Figure 18. Quantitative Fault Tree (Click to Zoom)
The number under each basic event is the probability that it will occur. The conditional probability that the end event will occur is determined from probability theory. For example, to determine the criticality of event 1, multiply its probability of occurrence (.01) by the probability that the end event will occur, given that event 1 has happened. From Table 23, the end event (H) will occur when both events A and B occur. Hence, its probability is the product of the probability that A will occur and the probability that B will occur.
Since event A is connected to its causes (events 1 and 2) by an AND gate, the AND gate probability equation applies. When calculating the criticality of event 1, however, the event is assumed to have occurred and its probability will be set to 1.0 so the probability of event A, given event 1 has occurred, is simply the probability that event 2 will occur (.03).
Event B is connected by an OR gate to its causing events, so either event 3 or event 4 will cause event B. To calculate its probability, note that the probability of B occurring is one minus the probability that it will not occur, and that the probability of B not occurring is the product of the probability that event 3 will not occur times the probability that event 4 will not occur. Further, the probability that event 3 (or event 4) will not occur is one minus the probability that it will occur.
Note that when calculating the criticality of either event 3 or event 4, the probability of event B happening will be 1.0, since either event will cause event B, and the event whose criticality is being computed is assumed to have happened (i.e., has a probability of occurrence of 1.0).
Using the AND and OR gate equations, the criticality of each of the four basic events of Figure 18 can be computed. The results are given in Table 23, which shows that events 1 and 2 are the most critical, and event 3 is the least critical.
Table 23. FTA Criticality Results
| Basic Event |
P(x) |
P(A)/Xi |
P(B)/Xi |
P(H/Xi) |
Criticality P(Xi) [P(H/Xi)] |
| 1 |
.01 |
.03 |
.09 |
.0027 |
.000027 |
| 2 |
.03 |
.01 |
09 |
.0009 |
.000027 |
| 3 |
.04 |
.0003 |
1. |
.0003 |
.000012 |
| 4 |
.05 |
.0003 |
1. |
.0003 |
.000015 |
3.11 Finite Element Analysis (FEA)
3.11.1 Purpose. Simulation techniques are very effective checks of mechanical and thermal robustness of product designs prior to production. Finite Element Analysis (FEA) is a simulation technique, usually computer implemented, that estimates material response to loads or environmental disturbances. The analysis can be used to assess the potential for thermal or mechanical failure in reaction to the expected loads, or assessment of failures resulting from testing.
3.11.2 Benefits. The benefits of a FEA are the early discovery of life limiting material deficiencies and the uncovering of excessive environmental load conditions. With the identification of the deficiency, either more robust components or better environment isolation techniques can be introduced to reduce the load's impact on the product design. This analysis can be performed before product manufacturing to uncover problems, after design changes to detect weaknesses, or after problem areas have been determined through testing.
3.11.3 Timing. The most effective FEA occurs when the product or item is developed to the point where the material and design properties can be clearly defined. Since FEAs are time consuming and costly, the items to be analyzed should be selected very carefully. When used as an assessment tool, failure trends or problem areas would be potential candidates for the analysis.
3.11.4 Application Guidelines. A FEA is the breakdown of a product into one or more elements that can be represented by mathematical models of an idealized structure. Each structure is represented by a grid of node points with interconnections. Without the use of computers to solve these models, the technique is restricted to the most simple or ideal problems. With the use of high speed digital computers, the scope of this analysis has been expanded to analyze complex items such as a liquid cooled high powered traveling wave tube (TWT) for thermal displacement of internal components relative to the tube envelope. With the use of a computer, a solution can be obtained by combining individual elements into an idealized structure for which conditions of equilibrium and compatibility are satisfied.
Application of a FEA is especially appropriate for products that use advanced or unique packaging or design concepts. The types of problems that can be analyzed include mechanical stress analysis, heat transfer, fluid flow, vibration and elasticity.
The most difficult and time consuming part of a FEA is establishing the detailed mathematical models and boundary conditions. Therefore, selection of items to be analyzed should be performed very carefully. Selection criteria should include:
- New materials or technologies
- Severe environmental load conditions
- Critical thermal or mechanical constraints
- Failure trends
The general steps to be followed in performing a FEA are presented in Table 24.
Table 24. Steps for Performing a Finite Element Analysis
| Step |
Comments |
Circuit Board Example |
| 1. Idealize the product into an analyzable form |
Potential forms used to develop a coarse FEA “mesh” are:
- 2-dimensional model (noncomplex items like beams, trusses, thin shells)
- 3-dimensional model (brick elements)
Boundary and load conditions are defined for the materials, environment and structural support |
 |
| 2. Reduce the coarse mesh to a small area (or single device) to determine more accurate stress information |
Several methods exist, including the “direct method”, in which the product is simplified so it can be described by ordinary differential equations. The direct method is effective if exact equation solutions are available, but not effective for irregular geometries which require nonlinear solutions. |
.jpg) |
| 3. Use deterministic life analysis using stress and cycles to failure |
Stress may be temperature induced deflection or vibration displacement. |
.jpg) |
| 4. Determine a probability of success based on the statistical distribution of failures resulting from the stress and cycles |
|
.jpg) |
3.12 Part Obsolescence
3.12.1 Purpose. The part obsolescence task has a two fold purpose:
- Assess design changes or alternative technologies to minimize the use of obsolete, or soon to be obsolete, parts and materials and their sources or suppliers.
- Select parts or materials with alternate sources or suppliers to replace potential obsolete parts and diminishing sources
Successful management of parts obsolescence requires continuous assessment to ensure parts and materials availability to support the product over its life cycle.
3.12.2 Benefits. The benefits of a parts obsolescence management include continued part availability and the use of preferred manufacturing processes. The continued use of such a process will result in efficiently implementing alternate courses of action when appropriate, such as life time buys, substitute parts or investment in new technology.
3.12.3 Timing. Part and vendor obsolescence management should be a basic part of company operating/design/manufacturing procedures (i.e., best commercial practices) implemented during all phases of product development and should essentially be product independent. Implementation prior to the start of the Design/Development phase will ensure reliable product operation and adequate repair support.
3.12.4 Application Guidelines. The ability to guarantee part and material availability during product design, manufacturing and field service encompasses two areas of concern. One factor that can limit product availability is obsolescence, which occurs when parts that are required for product manufacture or support are no longer manufactured because there is insufficient market demand. It is common to have products and systems whose lifetimes are greater than the life cycle of part technologies.
The second factor that must be considered is the potential for diminishing sources, causing parts that are not technically obsolete to become difficult to obtain. This can be the result of the manufacturer experiencing limited orders, downsizing market conditions, market instability, or a business decision to exit the market for a particular technology or device. Regardless of the reason, the part is unavailable, and the effect is essentially the same as if the part had become obsolete. When end-of-life parts are identified, despite the proactive management of parts and vendors to alleviate or minimize device obsolescence and diminishing sources problems, short and long term solutions are required. The short term solution begins when a device is unavailable and typically is resolved through a part replacement. The long term solution ensures future product producibility. When implementing a fix for a specific obsolescence problem, the long and short-term solutions may be different. The short term fix usually involves seeking alternate vendors or lifetime buys, where the long term solution could involve investing in new technology or a redesign of the product to allow use of readily available parts and technology.
Early identification of part/vendor end-of-life status provides an opportunity to select an acceptable solution that will minimize the impact on manufacturing. External sources such as the Defense Logistics Agency/Defense Electronic Supply Center, Government
Figure 19. Part Obsolescence Solution Flowchart (Click to Zoom)
3.13 Predictions
3.13.1 Purpose. One purpose for reliability prediction is to assess the product design progress and to provide a quantitative basis for selection among competing approaches or components. Predictions are a cost effective way to quickly analyze basic applications and stresses on all components.
3.13.2 Benefits. Prediction methods assess progress in meeting design goals, achieving component or part derating levels, identifying environmental concerns and controlling critical items. In addition, prediction results can be used to rank design problem areas and assess trade study results.
3.13.3 Timing. Predictions should be an on-going activity that starts with the initial design concept and the selection of parts and materials, and continues through alternate design approaches, redesigns, and corrective actions. Each prediction should provide a better estimate of product reliability as better information on the product design approach becomes available. Later predictions during the Design/Development phase evaluate stress and life limiting constraints, as well as identify design problem areas.
3.13.4 Application Guidelines. The first step in the selection and application of a prediction technique is to determine what class of failure is being considered, i.e., early defects, random events or wearout failure. Figure 20 shows the basic elements of a failure rate vs. time curve.
Figure 20. Failure Rate vs. Time (Click to Zoom)
The second step is the choice of methodology that is to be used to predict the product reliability. There are many forms of reliability prediction and Table 25 provides an overview of the major types. The classes of failure are "early defect" which assumes a decreasing product failure rate, "random events", which assumes a constant failure rate and "wearout" which assumes an increasing product failure rate.
Table 25. Reliability Prediction Methodologies
| Methodology |
Early Defects |
Random Events |
Wearout |
Description |
| Empirical |
√ |
√ |
|
Typically relies on observed failure data to quantify part-level empirical model variables. Premise is that valid failure rate data is available. |
| Translation |
√ |
√ |
|
Translates a reliability prediction based on an empirical model to an estimated field reliability value. Implicitly accounts for some factors affecting field reliability not explicitly accounted for in the empirical model. |
| Physics-of-
Failure |
|
|
√ |
Models each failure mechanism for each component, individually. Component reliability is determined by combining the probability density functions associated with each failure mechanism. |
| Similar Item
Data |
√ |
√ |
|
Based on empirical reliability design data from products similar to the one being analyzed. Product similarity should include complexity, maturity, manufacturing processes, design processes, function, and intended use environment. Uses specific product predecessor data. |
| Generic System Level Models |
√ |
√ |
|
Based on empirical reliability field failure rate data on similar products operating in similar environments. Uses generic data from other organizations. |
| Test or Field Data |
√ |
√ |
√ |
Product in-house test data is used to extrapolate estimated field reliability of the product. |
| Software
Estimate |
√ |
|
|
Most prediction methods rely on estimating the number of initial
defects (program errors) and the rate of removal. |
The third step in setting up a prediction program is to select methodologies that are effective for a given product time phase. Table 26 illustrates the points in the product life cycle when different techniques should be considered.
Table 26. Reliability Prediction Technique Alternatives
| Time |
Level |
Example |
Techniques |
| Conceptual Design |
Product or System |
Computer |
Similar Item Data
Generic System Level
Translation
Test or Field Data
Empirical Parts Count |
| Preliminary Design |
Assembly or Component |
Processor |
Similar Item Data
Generic System Level
Empirical Parts Count
Software Estimate |
| Final Design |
Circuit or Part |
Microcircuit |
Empirical Stress Analysis
Test or Field Data
Physics of Failure |
| Testing |
Component to Product |
Power Supply |
Test or Field Data
Physics of Failure |
Each methodology has numerous models or data sources for predicting reliability. Table 27 presents a brief summary of the methodologies and some of the data sources.
Table 27. Methodologies and Model/Data Source
| Methodology |
Source of Model |
| Empirical |
Part Count Method:
MIL-HDBK-217
Bellcore
British Telecom
Part Stress Analysis:
MIL-HDBK-217
British Telecom
French CNET |
| Translation |
Empirical to Field Reliability:
Reliability Toolkit: Commercial Practices Edition
RADC-TR-89-299 "Reliability&Maintainability Operational
Parameter Translation" |
| Physics of Failure |
Prediction Based on Failure Mechanisms
RADC-TR-90-72 "Reliability Analysis/Assessment of Advanced
Technologies"
CINDAS Data (Center for Information and Numerical Data Analysis
and Synthesis) |
| Similar Item Data |
Use Existing similar products that have time to failure information i.e.,
failures, cycles, operating time, storage time |
| Generic System Level
Model |
Other sources of similar item data such as:
Reliability Toolkit: Commercial Practices Edition
NPRD-95 "Nonelectronic Parts Reliability Data" |
| Test or Field Data |
Use existing product test or field information, adjusted by the environment if
different |
| Software Estimate |
Models that could be considered include:
Musa Model
Putnam's Time Axis
Exponential Model |
Parts Count Prediction. The parts count method is generally used to analyze electronic circuits in the early design phase, when the number and type of parts in each class (such as capacitor, resistor, transistor, microcircuit, etc.) are known and the overall design complexity is likely to change appreciably during later phases of design/development. The method starts with the listing of the part types and their expected quantities. Reliability data is then taken from source books such as MIL-HDBK-217 "Reliability Prediction of Electronic Equipment." Failure rates, quantities of parts and adjustment factors are multiplied and the results for each part type are summed to determine the product reliability. This method assumes that the times-to-failure of the parts are exponentially distributed. The general expression for a product failure rate using this method is:

| where, |
|
| |
λproduct |
= |
Total failure rate (failures per unit time) |
| λGi |
= |
Generic failure rate for the ith generic part |
| πAi |
= |
Adjustment factor for the ith generic part (quality factor,
temperature factor, environmental factor) |
| Ni |
= |
Quantity of ith generic part |
| n |
= |
Number of different generic part categories |
Example of a Parts Count Prediction. An electronic receiver is analyzed using the parts count method. The part types and quantities are indicated in Table 28. The part failure rate data was obtained from field experience data for a ground mobile ( GM) environmental condition. An adjustment to an airborne inhabited cargo (AIC) environment is needed. What is the estimated reliability of the receiver in terms of mean-time-between-failure (MTBF)?
Table 28. Electronic Receiver Reliability Parts Count Analysis
| Device |
Quantity |
GM Failure Rate
(Failures/106 Hrs.) |
Adjustment* Factor
GM to AIC |
Component Type
Failure Rate
(Failures/106 Hrs.) |
| Microcircuit |
25 |
0.06 |
(1/1.4) = 0.71 |
1.07 |
| Diode |
50 |
0.001 |
(1/1.4) = 0.71 |
0.04 |
| Transistor |
25 |
0.002 |
(1/1.4) = 0.71 |
0.04 |
| Resistor |
100 |
0.002 |
(1/1.4) = 0.71 |
0.14 |
| Capacitor |
100 |
0.008 |
(1/1.4) = 0.71 |
0.57 |
| Switch |
25 |
0.02 |
(1/1.4) = 0.71 |
0.36 |
| Relay |
10 |
0.40 |
(1/1.4) = 0.71 |
2.84 |
| Transformer |
2 |
0.05 |
(1/1.4) = 0.71 |
0.07 |
| Connector |
3 |
1.00 |
(1/1.4) = 0.71 |
2.13 |
| Circuit Board |
1 |
0.70 |
( 1/1.4) = 0.71 |
0.50 |
|
| Totals (λ T) |
|
7.76 |
MTBFTotal = 1 / λT = 1 / 7.76x10-6 = 128,866 hours
*Environmental adjustment factor source "Reliability Toolkit: Commercial Practices Edition", page 176
The product reliability is determined by multiplying the quantity of each part type by its failure rate, then adjusting the failure rate from GM to AIC environmental conditions. The failure rate results of the parts are then summed to determine the product failure rate.
Part Stress Prediction. The part stress analysis method is used in the detailed Design/Development phase when individual part level information and design stress data is available. The method requires the use of defined models that include electrical and mechanical stress factors, environmental factors, duty cycles, etc. Each of these factors should be known, or be capable of being determined, so that the effects of those stresses on the parts' failure rates can be evaluated. Table 29 shows several major factors which influence device reliability.
Table 29. Major Influence Factors on Device Reliability
| Device Type |
Influence Factors |
|
Device Type |
Influence Factors |
| Integrated Circuits |
• Temperature
• Complexity
• Supply Voltage |
|
Capacitors |
• Temperature
• Voltage
• Type |
| Semiconductors |
• Temperature
• Power Dissipation
• Breakdown Voltage
• Material |
|
Inductive Devices |
• Temperature
• Current
• Voltage
• Insulation |
| Resistors |
• Temperature
• Power Dissipation
• Type |
|
Switches and Relays |
• Current
• Contact Power
• Type |
A typical empirical mathematical model is illustrated as follows (using a ceramic trimmer capacitor as an example):
| |
λp = λb • πT • πC • πV • πQ • πE |
| where, |
|
| |
λp |
= |
Trimmer capacitor failure rate |
| λb |
= |
Base failure rate (laboratory failure rate in the absence of dynamic stresses) |
| πT |
= |
Temperature factor |
| πC |
= |
Capacitance factor |
| πV |
= |
Voltage stress factor |
| πQ |
= |
Quality factor |
| πE |
= |
Environmental factor (accounts for dynamic stresses in the end-user environment) |
A stress-temperature failure rate plot for this example is shown in Figure 21. As can be seen from the plot, the failure rate increases as the temperature goes up, or as the applied stress (voltage) increases.
Figure 21. Trimmer Ceramic Capacitor Failure Rates/Stress Plot
Predictions from Test or Field Data Analysis. Development programs often make use of existing equipment (or assembly) designs, or designs adapted to a particular application. If this situation exists, Table 30 summarizes the necessary characteristics of the data needed for reliability analyses.
Table 30. Use of Existing Reliability Data
| Information Required |
Product Field Data |
Product Test Data |
Piece Part Data |
| Data collection time period |
X |
X |
X |
| Number of operating hours per product |
X |
X |
|
| Total number of part hours |
|
|
X |
| Total number of observed maintenance actions |
X |
|
|
| Number of "no defect found" maintenance actions |
X |
|
|
| Number of induced maintenance actions |
X |
|
|
| Number of "hard failure" maintenance actions |
X |
|
|
| Number of observed failures |
|
X |
X |
| Number of relevant failures |
|
X |
X |
| Number of nonrelevant failures |
|
X |
X |
| Failure definition |
|
X |
X |
| Number of products or parts to which data pertains |
X |
X |
X |
| Similarity of product of interest to product for which
data is available |
X |
X |
|
| Environmental stress associated with data |
X |
X |
X |
| Type of testing |
|
X |
|
| Field data source |
X |
|
|
Similar Item Prediction. This method starts with the collection of past experience data on similar products. The data is evaluated for form, fit and function (FFF) compatibility with the new product. If the product is an item that is undergoing a minor enhancement, the collected data will provide a good basis for comparison to the new product. Small differences in operating environment or conditions can be accounted for. If the product does not have a direct similar item, then lower level similar circuits can be compared. In this case, data for components or circuits are collected and a product reliability value is calculated. The general expression for product reliability calculated from its constituent components using the similar item method is:
| |
Rp = R1 • R2 ... Rn |
| where, |
|
| |
Rp |
= |
Product reliability |
| R1 • R2 ... Rn |
= |
Component reliability |
Example of a Similar Item Prediction. A new computer product is composed of a processor, a display, a modem and a keyboard. The new product is expected to operate in a 40°C environment. Data on similar components was located and is shown in the second column of Table 31. The similar item data is for a unit operating in a 20°C environment. What mean-time-between-failure can be expected for the new system if a 30% reliability improvement (as a result of improved technology) is expected?
Table 31. Reliability Analysis Similar Item
| Items |
Similar Data
MTBF (Hrs.) |
Temperature*
Factor |
Improvement
Factor |
New Product
MTBF (Hrs.) |
| Processor |
5,000 |
0.8 |
1.30 |
5,200 |
| Display |
15,000 |
0.8 |
1.30 |
15,600 |
| Modem |
30,000 |
0.8 |
1.30 |
31,200 |
| Keyboard |
60,000 |
0.8 |
1.30 |
62,400 |
| |
| System |
3,158 |
|
|
3,284 |
| * Temperature conversion factor source "Reliability Toolkit: Commercial Practices Edition", page 176 |
Each component MTBF is corrected for the change in temperature of 20°C to 40°C. Technology improvements were also included and the product mean-time-between- failure (MTBF) was calculated using the expression:
| |
MTBFp = ∑ (1 / λi) |
| where, |
|
|
|
| |
MTBFp |
= |
mean-time-between-failure of the product |
| |
λi |
= |
failure rate of the i component = (1 / MTBFi) |
Software Reliability Prediction. Predicting software reliability is difficult because software failures arise from software faults resulting from missing, extra or defective lines of code. The time to failure often depends on the execution speed of the computer and size of the program. A software growth model mathematically summarizes a set of assumptions about the phenomenon of software failure. A general form is as follows:
Initial Software Failure Rate Model:
| |
λo = ( ri K Wo / I ) failures per computer second |
| where, |
|
| |
ri |
= |
Host processor speed (instructions/sec) |
| K |
= |
Fault exposure ratio, which is a function of program data dependency and structure (default = 4.2 x 10-7) |
| Wo |
= |
Estimate of the initial number of faults in the program (default = 6 faults / 1000 lines of code) |
| I |
= |
Number of object instructions, which is determined by the number of source lines of code times the expansion ratio, given below: |
| |
| Programming Language |
Expansion Ratio |
| Assembler |
1 |
| Macro Assembler |
1.5 |
| C |
2.5 |
| COBOL |
3 |
| FORTRAN |
3 |
| JOVIAL |
3 |
| Ada |
4.5 |
|
Software Reliability Growth:
| |
λ(t) = λo e-[βt] |
| where, |
|
| |
λ(t) |
= |
Software failure rate at time t (failures per computer second) |
| λo |
= |
Initial software failure rate |
| t |
= |
Computer execution time (seconds) |
| β |
= |
B ( λo / Wo ) (Decrease in failure rate per failure occurrence) |
| where, |
|
| |
B |
= |
Fault reduction factor (default = .955) |
| Wo |
= |
Initial number of faults in the software program per 1,000 lines of code |
Example: Reliability Prediction of an Ada Software Program. Estimate the initial
software failure rate and the failure rate after growth testing for 40,000 seconds of
computer execution time at 2 million instructions per second (MIPS). The software is a
20,000 line Ada program.
| ri |
= |
2 MIPS = 2,000,000 instructions/sec |
| K |
= |
4.2 x 10-7 |
| Wo |
= |
(6 faults/1000 lines of code) (20,000 lines of code) = 120 faults |
| I |
= |
(20,000 source lines of code) (4.5) = 90,000 instructions |
| λo |
= |
[(2,000,000 inst./sec) (4.2 x 10-7 ) (120 faults)] / [90,000 inst. ] |
| = |
.00112 failures/computer second |
| β |
= |
B ( λo / Wo ) (.955)
[ (.00112 failures/sec) / 120 faults ] = 8.91 x 10-6 failures/sec |
| λ(40,000) |
= |
.00112 e-[(8.91 x 10-6 failures/sec) (40,000)] |
| λ(40,000) |
= |
.000784 failures/computer second |
Physics-of-Failure Prediction. A physics-of-failure prediction looks at individual failure mechanisms such as electromigration, solder joint cracking, die bond adhesion, etc., to estimate the probability of device wearout within the useful life of the product. This analysis requires detailed knowledge of all material characteristics, geometries, and environmental conditions. Specific models for each failure mechanism are available from a variety of reference books. A typical model for bond pad/die shear fatigue is illustrated below, where the dependent coefficients are determined through the use of published manuals on material characteristics.
| |
t50 |
= |
A2 (K2ΔT)n (1.2) |
| where, |
|
| |
t50 |
= |
Mean-time-to-failure (hrs.) |
| A2 |
= |
Pad material dependent coefficient |
| K2 |
= |
Die material dependent coefficient |
| n2 |
= |
Wire material dependent coefficient |
| ΔT |
= |
Temperature change at bond pad and die (°C) |
Example of a Physics-of-Failure Prediction for Wire Bond. Using a standard microcircuit in a ground fixed application, determine the mean number of cycles to failure for an aluminum bond wire. Given the equation:

| where, |
|
| |
Nf |
= |
the number of cycles to fatigue failure |
| A1 |
= |
3.93 x 10-10 (Aluminum fatigue stress in bending loading) |
| n1 |
= |
-5.134 (Aluminum fatigue stress in axial loading) |
| r |
= |
1.6 x 10-2 (Wire size) |
| αw |
= |
22.3 x 10-6 (Aluminum wire property) coefficient of expansion |
| αs |
= |
4.67 x 10-6 (Silicon substrate property) coefficient of expansion |
| ΔT |
= |
55 (temperature °C) |
| Nf(flex) |
= |
2.36 x 1016 cycles |
3.14 Sneak Circuit Analysis
3.14.1 Purpose. The purpose of performing a sneak circuit analysis is to find and fix each sneak failure cause to improve the product design. Iterative sneak analysis will assess progress in identifying and eliminating problems that may unexpectedly occur during normal product operation or through incorporation of design modifications and "fixes". The only preventive measure for identifying these sneak circuits is an in-depth manual or computer-aided circuit analysis.
3.14.2 Benefits. Finding and correcting design flaws before selling or using a product will enhance customer satisfaction. In addition, reassessments should be performed every time a design change is introduced. With the development of automated tools, all computer-aided designs can be checked almost as easily as a text document can be spell-checked. These tools increase the scope of application significantly. Specific benefits include:
- Detection of hidden failures
- Prevention of costly redesigns
- Verification of circuit interface integrity
- Ensuring high reliability
- Avoidance of litigation resulting from undiscovered sneak paths
3.14.3 Timing. To maximize the benefit of sneak circuit analysis on a product design, an automated design analysis should be performed as the computer aided design progresses through the product Design/Development phase. This procedure will allow the designer to correct flaws "on the fly" without significant schedule or cost impact. Updates should be performed any time the product or process is changed.
3.14.4 Application Guidelines. The first step in a sneak circuit analysis is the understanding of the definitions and causes of sneak circuits by the personnel performing the analysis. A brief summary of these are:
Definitions
- Sneak Circuit: A condition which causes the occurrence of an unwanted function, or inhibits a desired function, even though all components are performing properly
- Sneak Timing: Unexpected interruption or enabling of a function due to a switching fault. Usually occurs within a single function timing plan, with little influence from unrelated functions.
- Sneak Paths: Unintended control or power paths connecting system functions that enable them to interact with each other. Usually occurs between unrelated functions that are tied to common power, ground, or control mechanisms.
- Sneak Indications: Incorrect or ambiguous specification of sensors that do not clearly define their purpose or methods of operation. Usually impacts a product through inaccurate measurements during product operations.
- Sneak Labels: Incorrect or ambiguous documentation of designs or production drawings which lead to conflicting interpretations of their purpose. Usually impacts a product through manufacturing flaws.
- Sneak Clues: Design rules, guidelines, and insights applied to topographical patterns by sneak circuit analysis specialists to identify potential sneak conditions. A sneak clue is often proprietary information that is constantly updated to account for new technologies and design methods.
- Topographical Patterns: Forms used to model system networks that enable analysts to apply sneak clues used in performing a sneak circuit analysis.
Causes of Sneak Circuits
- Complexity of design
- Interfaces between distinct functions
- Inadequate understanding of the product design
- Integration of multiple functions
- Design constraints (i.e., volume, weight, or power)
In performing a sneak analysis, the first step is the selection of an analysis technique. Table 32 illustrates three types of common sneak circuit analysis techniques.
Table 32. Sneak Circuit Analysis Techniques
| Type of Analysis |
Characteristics |
| Sneak Path: A methodical investigation of all possible circuit paths in an electrical/electronic product. |
Used primarily for detecting sneak circuits in hardware products and systems, such as power distribution, control, switching networks, and analog circuits. The analysis is based on known topological similarities of sneak circuits in these types of products. |
| Digital Sneak Circuit: An analysis of digital hardware networks for sneak conditions, operating modes, timing races, logical errors, and inconsistencies. |
Depending on product complexity, digital sneak analysis may involve the use of sneak path analysis techniques, manual or graphical analysis, computerized logic simulators, or computer- aided design circuit analysis. |
| Software Sneak Path: An adaptation of sneak path analysis to computer program coding logical flows. |
The technique used to analyze software logical flows by comparing their topologies to those containing known sneak path conditions. |
After selecting a technique, the second step in the application of a sneak circuit analysis is the understanding of topological patterns. These patterns are the key-stones of hardware and software analysis. Typical topographical patterns for hardware and software products are illustrated in Figures 22 and 23.
Figure 22. Software Topographs (Click to Zoom)
The third step is to determine what areas need assessment. The areas for which sneak
analysis is typically recommended include safety critical functions, high levels of circuit
interface or complexity, or power supply inputs.
The fourth step in the sneak circuit analysis is to transform the product schematic
diagrams into network tree diagrams. Finally, the analyst will attempt to identify the
basic topological patterns, as shown in Figures 22 and 23, within the network trees. If
one is identified, then design corrections can be determined.
Figure 23. Hardware Topographs (Click to Zoom)
Example of a Sneak Indicator Problem. If an indicator circuit depends upon the
operation of the monitored function (Figure 24a), improper or unexpected operation of the function may inhibit the indicator circuitry. Figure 24a shows that if the heating element has failed open, the monitoring lamp will indicate that the power is off when, in fact, the power is still available at the supply side of the heating element. This misleading indication is a safety hazard to service personnel. A solution is to design a circuit with the indicator and the heating element in parallel, as shown in Figure 24b.
Figure 24. Sneak Indicator Problem and Solution (Click to Zoom)
3.15 Thermal Analysis
3.15.1 Purpose.
The general purpose of a thermal analysis is to determine the adequacy of the thermal design of the product in its intended use environment. This analysis is a relatively cost effective way to assess thermal characteristics from the component to the product level.
3.15.2 Benefits. Thermal analysis is a useful tool in characterizing product temperature profiles without resorting to testing. The most important benefits are:
- Estimating operating temperatures of parts and components to assess compliance with customer needs/requirements
- Determining thermal expansion of materials to aid in material selection based on their coefficient of expansion
- Identifying hot spot areas or parts exceeding allowable limits for design/part selection trade-offs
- Optimizing the product thermal design to maximize inherent reliability
3.15.3 Timing. Evaluation of the thermal design starts during the Concept/Planning phase and is an on-going assessment. Each change or modification to the product design requires a re-evaluation to ensure that the gross thermal conditions are controlled for the expected environment and type of product.
3.15.4 Application Guidelines. In order to assess reliability progress, thermal estimates of assemblies and parts are needed. The specific target areas requiring assessment are those products with high power dissipation, thermally sensitive components, extreme operating environments or high package density. Thermal analysis can range from a simple calculation performed with pocket calculators to solving complex problems through the use of sophisticated computer programs. The three basic levels of thermal analysis are preliminary, intermediate and detailed. As the design progresses, the information required for effective thermal analysis increases and the associated level of detail also increases. Table 33 shows the characteristics of different basic thermal analysis approaches.
Table 33. Thermal Analysis Approach Characteristics
| Characteristic |
Preliminary Analysis |
Intermediate Analysis |
Detailed Analysis |
| Time Phase |
Conceptual design |
Preliminary design |
Final design |
| Accuracy |
Gross product estimates |
Component estimates |
Exact part values |
| Dissipation |
Rough part count |
Refined part count |
Detailed part count |
| Thermal Resistance |
Rough estimate |
Drawings; data; possible
tests |
Detailed drawings data;
tests |
| Hot Spots |
Critical areas |
Components identified;
solutions considered |
Parts identified; changes
initiated |
| Sink Temperatures |
Rough estimates |
Preliminary calculations |
Precise calculations |
Preliminary Analysis. This type of analysis is generally performed during the early phases of a product development to explore alternate concepts and to allocate cooling needs. Many assumptions need to be made, but the results do not have to be highly accurate. Gross estimates are usually sufficient to decide which thermal design approach is most appropriate. Analysis is usually performed using calculators and handbooks. To describe this stage of the analysis, a five-node thermal model of a transistor on a circuit board in an enclosure is shown in Figure 25. Each of the resistance paths would be estimated using handbook values so that an estimate of the junction temperature of the transistor could be made.
Figure 25. Five-Node Transistor-Board-Enclosure (Click to Zoom)
Intermediate Analysis. The intermediate analysis is performed when the design is beginning to be refined. Part lists are being established and individual component temperature values can be calculated. Most of these analyses are done with the aid of thermal models and computer programs. As the design progresses, changes can be evaluated for thermal impact on the inherent product reliability.
Detailed Analysis. This analysis is performed as detailed design information becomes available, including drawings, part specifications, and material properties. Accurate temperature predictions at any level can be obtained. This level of analysis relies on the use of detailed thermal models, typically having thousands of nodes; therefore, the use of high speed computers and sophisticated software programs is required.
Example. Three part temperature analysis techniques are described in Figures 26 through 28. Figure 26 is an estimate of part temperature for free-convection and radiation cooled equipment using sea-level ambient air as the heat sink. Figure 27 is an estimate of a non-card mounted part temperature using forced air-impingement cooling at sea-level. Figure 28 estimates the temperature of a circuit board mounted part using forced-air impingement cooling at sea-level.
Figure 26. Free Convection to Air Ambient and Radiation (Click to Zoom)
Figure 27. Air Impingement, Non Circuit Board Mounted (Click to Zoom)
Figure 28. Air Impingement, Circuit Board Mounted (Click to Zoom)
3.16 Worst Case Circuit Analysis (WCCA)
3.16.1 Purpose. A Worst Case Circuit Analysis (WCCA) technique is used to assess progress in desensitizing the product to extreme environmental conditions or component parameter changes as the product is used.
3.16.2 Benefits. The benefits associated with conducting a worst case circuit analysis
are that it:
- Identifies parts exceeding component derating limit guidelines/requirements
- Analyzes circuits for design faults
- Identifies components that may be overstressed
- Provides a realistic estimate of true worst case performance
- Provides information on possible life limiting conditions and components
- Exposes failures that may be safety risks
3.16.3 Timing. Due to the need for detailed information on the design, materials, parts and processes, this analysis technique is not recommended for application during the product Concept/Planning phase. The best time would be after the initial design review (early Design/Development), with appropriate updates as the product design changes to determine critical component parameter variation and environmental effects on circuit performance. The further along in the design and development phase that WCCA is performed, the more expensive it will be to introduce changes to the design.
3.16.4 Application Guidelines. There are a number of techniques for performing a WCCA, each with its own advantages and disadvantages. Three techniques, the extreme value analysis, the root sum squared and the Monte Carlo analysis, are described. To perform an analysis quickly and accurately, a computer circuit analysis program that is compatible with computer-aided design tools has a decided advantage.
Extreme Value Analysis. Extreme value analysis (EVA) is an analysis of a circuit output with all variables set to the worst possible values. For example, the output of the amplifier circuit shown in Figure 29 will vary as the parameters of its components vary away from nominal. To do an extreme value analysis, the worst case expected values of each component, in both directions from nominal, should be used.
For example, maximum and minimum amplifier gains are calculated with all the components at their extreme values in the direction which would increase the gain and again with all components at their extreme values in the direction which would decrease the gain. The calculated values are then compared to the specified limits to evaluate the robustness of the circuit. If the gain is within specified limits when the components are at extreme values, part variation should be no problem in normal operation.
Figure 29. Amplifier Circuit (Click to Zoom)
Example of an EVA Analysis of an Amplifier. The EVA analysis is the easiest of the numerical techniques to apply, as is shown based on the previous amplifier circuit (Figure 29).
| Given: |
| |
| |
• Resistor tolerance = 5%; RI = 1KΩ; RF = 10KΩ |
| |
• Gain = Vin / Vout = RF / RI |
| |
• Required gain = 9.0 or greater |
| |
| Under worst case conditions: |
| |
| |
• Minimum Gain |
= |
RF (min) / RI (max) = RF - (.05RF ) RI + (.05RI ) |
| |
= |
(10,000 - 500) / (1,000 + 50) |
| = |
9,500 / 1,050 = 9.05 |
| |
• Maximum Gain |
= |
RF (max) / RI (min) = [RF + (.05RF )] / [RI - (.05RI)] |
| |
= |
(10,000 + 500) / (1, 000 - 50) |
| = |
10,500 / 950 = 11.05 |
Based on the calculations, the analysis indicates that the worst case gain conditions are sufficient to meet the needs (i.e., 9.05 greater than 9.0).
Root Sum Squared. Root Sum Squared (RSS) analysis recognizes that it is rare for all parameters of a part to simultaneously drift to extreme values. While some variation is biased in a single direction, other changes vary randomly in direction, sometimes helping to compensate for bias variations and sometimes adding to the bias. For example, the initial value of a capacitor will likely vary in a manner described by a normal curve whose mean is the nominal value. The extreme values of this distribution are ordinarily taken as the values at plus and minus three standard deviations from the mean value (the points between which 99.7% of the values will lie). In RSS analysis, the extreme value of each random variation is squared, the resulting values added, and the square root taken of the total. The resulting value is the maximum variation expected due to random factors. This is added to the bias variations to calculate the maximum and minimum worst cases. The process is illustrated in Table 34.
Table 34. Root Sum Squared Analysis of a Capacitor
| Parameters: Capacitance |
Bias (%) |
Random (%) |
| Neg. |
Pos. |
| Initial Tolerance at 25°C |
-- |
-- |
20 |
| Low Temperature (-20°C) |
28 |
-- |
-- |
| High Temperature (+80°C) |
-- |
17 |
-- |
| Other Environments (Hard Vacuum) |
20 |
-- |
-- |
| Radiation (10KR, 1013 N/cm2) |
-- |
12 |
-- |
| Aging |
-- |
-- |
10 |
| TOTAL VARIATION |
48 |
29 |
√(20)2 + (10)2 = 22.4 |
The worst case minimum value of capacitance would be the nominal value minus the negative bias variations, minus the random variation, or:
| Worst case minimum |
= |
Nominal (1 - bias variation - random variation) |
| |
= |
Nominal (1 - .48 - .224) = Nominal (1 - .704) |
The worst case maximum value would be the nominal value plus the positive bias
variation, plus the random variation, or:
| Worst case maximum |
= |
Nominal (1 + bias variation + random variation) |
| |
= |
Nominal (1 + .29 + .224) = Nominal (1 + .514) |
Monte Carlo. Monte Carlo analysis requires a probability density function for all variations in parameters. Through random selection, values are assigned to each part in the circuit and the output parameter computed. This is repeated many times and the distribution of the results represents the expected distribution of circuits in the field.
Factors to be Evaluated. In the process of performing a WCCA analysis, each component type has associated parameters which exhibit sensitivity to stress conditions and contribute to overstressed component conditions. Table 35 shows some common component parameters that should be evaluated as part of a thorough WCCA.
Table 35. Typical Component Factors to be Evaluated
Integrated Circuits (Linear/Digital)
• Power Dissipation
• Applied Voltage (VCC)
• Common Mode Voltage
• Loading |
• Fan-In/Fan-Out
• Differential Input Voltage
• Min/Max Input Voltage |
|
Transistors
• Applied Voltage (Vce, Vbe)
• Base/Collector Current |
• Power Dissipation
• Forward/Reverse Bias |
|
Magnetic Components
• Max Induction Levels (Saturation)/Losses
• Reset Conditions/Drive Imbalance
• Winding-to-Winding Voltages
• "Hot Spot" Temperature |
3.17 Test Strategy
3.17.1 Purpose. A test strategy is the plan for performing tests that add value to a particular product or system for the customer. The test strategy will identify which tests are appropriate, and at what level, to reflect a realistic assessment of the reliability of the product.
3.17.2 Benefits. A test strategy is intended to verify the achievement of product goals, determine shortcomings needing corrective action, and identify opportunities for improvement. A product specific test strategy is needed to assess design progress and adequacy. Part of the test strategy should include judicious application of value-added and cost effective tests to quantitatively assess design decisions and changes on long-term product reliability. Program budgets and schedules cannot ignore test costs. Hence, a test strategy must be an integral part of program planning and management.
3.17.3 Timing. Initial program planning during the product Concept/Planning phase must include a test strategy, particularly for those elements of test that will support the product design. As the program progresses, changes in the program (e.g. a decision to develop an item rather than buy it off-the-shelf) should be assessed for necessary changes in the test strategy. A test strategy, then, is needed at the start of a project, and may be subject to change as the product evolves through Design/Development. Every program review should include a conscious decision to retain or revise the test strategy.
3.17.4 Application Guidelines. The matrix of Table 36 relates program and product circumstances to their expected impact on the test techniques applied to the product during the early design stages. A "plus" sign (+) indicates that the test offers value to the program under that circumstance. A "minus" sign (-) means that the test is probably not cost effective for that circumstance. A "question mark" (?) indicates that the test may or may not add value for that circumstance, depending on the product. The circumstances considered are New Development (i.e., a product to be designed and built for the first time), COTS (an item available as a Commercial-Off-the-Shelf product), Safety Critical (e.g., a nuclear plant control system), Dormancy (i.e., an item to be subjected to long periods in storage or otherwise unpowered), Long Life (i.e., an item likely to be in service for a relatively long time, like a commercial passenger aircraft), Harsh Environment (e.g., high shock, rapid thermal cycling, et. al.), and S/W (Software) development. To assess reliability progress, the selection process for testing has to consider the risk of failure and criticality of the technology. Application should be limited to extraordinary circumstances.
Table 36. Test Techniques for Assessing Reliability
| Reliability
Test Technique |
Program/Product Circumstances |
| New
Dev. |
COTS |
Safety
Critical |
Dormancy |
Long
Life |
Harsh
Env. |
S/W
Dev. |
| Accelerated Life Tests |
? |
? |
+ |
? |
+ |
- |
- |
| Design of Experiment |
+ |
- |
+ |
- |
- |
+ |
- |
| Growth Test |
+ |
- |
+ |
? |
? |
+ |
+ |
| Test Analyze & Fix |
+ |
? |
+ |
? |
? |
+ |
+ |
Example of Test Strategy for Assessment. A new communication project for an unsheltered operating environment is under development and is utilizing off-the-shelf components with some new technology. A new high density power supply is being evaluated for use. What test strategy is appropriate?
One possible test strategy for assessing reliability progress could include accelerated testing to determine the reliability impact of using "new" technology parts. Also, a test, analyze and fix period could be used for longer term assessment of the power supply, including correction of demonstrated deficiencies in the design that would limit its inherent reliability.
3.18 Accelerated Life Testing
3.18.1 Purpose. The purpose of accelerated life testing is to determine or verify product performance in an expedient manner by using a variety of high environmental or electrical stress levels, singularly or in combination. The expected life span can then be determined in a shortened test time. Assessment of these data can result in definitive selection and application procedures for critical components. Design changes which result in the selection of new components, or redesign of existing assemblies would also identify potential candidates for accelerated life assessment.
3.18.2 Benefits. The major reason for using accelerated life testing is to reduce product test time, resulting in schedule and cost benefits. This type of testing often identifies design and manufacturing deficiencies, exposes dominant failure mechanisms and quantifies the relationship between stress and performance. This testing is effective on parts, components, or assemblies in identifying failure mechanisms and life limiting critical components.
3.18.3 Timing. Accelerated testing can be performed at any phase of a product development, provided the hardware is available. The Concept/Planning phase is the best time for accelerated testing as alternate design concepts, part types and material technologies can be considered before design or manufacturing processes are solidified. As an assessment tool, testing of design changes or alternate procedures can be performed to ensure that the customers'reliability needs will continue to be met.
3.18.4 Application Guidelines. There are many accelerated test approaches, some targeted to very specific technologies, others developed for broader applications. Constant stress testing is commonly defined by one or more stress factors, such as temperature, vibration, voltage, humidity, etc., at specific stress levels. The stress levels are predetermined and are usually well above the normal operating conditions for the product. The test items are divided into groups, one group for each stress level. For example, if voltage was the stress parameter for a capacitor, two or more groups could be tested at 110% and 120% rated voltage and the results extrapolated to the operating voltage. The test groups are operated under the defined stress condition for a predetermined time, usually governed by the program budget. Step stress testing differs in that the test items are exposed to progressively higher stress levels in a sequential manner.
The test program starts near the upper limit of the normal operating environment with all units tested together at the same stress level. After a planned interval of time, the stress is increased to the next level. The stepping procedure is continued until all test units have failed or the planned number of steps has been performed.
A typical accelerated test program for assessment would include:
- Planning. The planning aspect of accelerated testing is very important in determining what parts or higher level assemblies to consider, what environmental conditions to apply and what electrical stress levels to use. Some factors that should be kept in mind during the test planning are:
- Test units to be assessed must be identical to those considered for the final product
- Only one accelerating stress should be applied; other factors should be held constant
- Stress levels must be defined such that the precipitated failure modes are the same as those that would occur under normal operating conditions
- Accelerating stress levels should not exceed maximum component design limits
- Designing. In order to develop accurate and legitimate accelerated test models, the stress levels must be near or overlap the normal operating range. By having overlapping envelopes, extrapolations of test reliability results can be performed using empirical stress models, as opposed to theoretical models. An example of overlapping is shown in Figure 30.
Figure 30. Overlapping Stress Environments (Click to Zoom)
- Modeling. A number of models can be considered in evaluating accelerated testing results. Some of the more widely used models are:
- Arrhenius Model - Used for electronics, this model predicts exponential increases in the rate of a given reaction with temperature.
- Eyring Model - This model also determines the relationship of temperature as the accelerating parameter for an exponential life distribution.
- Inverse Power Law Model - Used for non-thermal accelerating stresses, where the underlying life distribution is Weibull.
- Analyzing. Table 37 illustrates two different methods for analyzing the results of accelerated tests.
Table 37. Two Methods for Analyzing Accelerated Test Data
| Characteristics |
Steps |
| Probability Plot |
- Operational performance
(e.g., time before failure) of nearly all electronic and electromechanical products can be described by either the lognormal or Weibull probability density functions (pdf).
- The pdf describes how the percentage of failures is distributed as a function of operating time.
|
- Rank the failure times from first to last for each level of test stress (non-failed test unit times are at the end of the list).
- For each failure time, rank i, calculate its plotting position:
P = 100 [ (i - 0.5) / n) ] ; n = total number of items on test at that level
- Plot P versus the failure time for each failure at each stress level on appropriate graph paper (i.e., logarithmic or Weibull).
- Visually plot lines through each set (level of stress) of points. Lines should be plotted in parallel, weighting the tendency of the data set with the most failures heaviest.
- If lines do not plot reasonably parallel, investigate failure modes.
|
| Relationship Plot |
- Constructed on an axis that describes unit performance as a function of stress.
- Two of the most commonly assumed relationships are inverse power and Arrhenius.
|
- On a representative scaled graph (e.g., Arrhenius paper), plot the 50% points determined from the probability plot for each test stress.
- Plot a single line through the 50% points, projecting beyond the upper and lower points.
- Locate the intersection of the plotted line and the normal stress value.
This point, read from the time axis, represents the time at which 50% of the units will fail while operating under normal conditions.
- Plot the time determined in Step 3 on the probability plot, drawing a line through this point parallel to the one previously drawn.
- The resulting line represents the distribution of failures as they occur at normal levels of stress.
|
Example of Probability and Relationship Plots. The Arrhenius model describes the
effect of temperature on a given electronic failure mechanism. For semiconductor
devices, these models are widely used because of their simplicity and reasonable
accuracy. Figure 31 illustrates the accelerating effect temperature has on the reaction
rate for two activation energy conditions, 0.9 and 0.4 electron volts. The goal is to find
the improvement factor for reliability if the junction temperature is lowered from 95°C
to 75°C for a semiconductor, given there are two failure mechanisms; electromigration
(activation energy of 0.9 electron volts) and bond fatigue (activation energy of 0.3
electron volts). From the figure, the temperature acceleration portion of the device
failure rate is located for the electromigration failure mechanisms and a 9 times
improvement (90 to 10) is indicated.
If the bond fatigue failure mechanisms is
considered, the improvement factor is only 1.25 (1.0 to 0.8).
Figure 31. Temperature Influence on Reliability (Click to Zoom)
Example of a Graphical Analysis. The database to be analyzed by graphical methods is a ten unit test, with the results as indicated in Table 38.
Table 38. Test Data Collected
| Time to Failure (Hours) |
Rank (i) |
P |
| 575 |
1 |
5 |
| 695 |
2 |
15 |
| 872 |
3 |
25 |
| 1250 |
4 |
35 |
| 1291 |
5 |
45 |
| 1402 |
6 |
55 |
| 1404 |
7 |
65 |
| 1713 |
8 |
75 |
| 1741 |
9 |
85 |
| 1893 |
10 |
95 |
The data points are plotted on Figure 32 and the analysis indicates that the mean is about 1,000 hours. A 90% confidence internal indicates that most units will fail before 2,000 hours and very few will fail before 600 hours.
Figure 32. Lognormal Plot of Test Results (Click to Zoom)
3.19 Reliability Growth Testing (RGT )/Test, Analyze and Fix (TAAF)
3.19.1 Purpose. A test conducted specifically to monitor improvements in reliability by finding and fixing deficiencies is called a reliability growth test, which has as its basis a less formal test, analyze and fix program. A growth test provides an estimate of what the current product reliability is, and can be used to assess the impact of design changes and corrective actions on the reliability growth rate of the product.
3.19.2 Benefit. RGT/TAAF can be used to prevent reliability problems on new products, and to improve existing products with inadequate reliability. Dedicated reliability growth tests can prevent the delivery of unsatisfactory products to the customer, saving repair/replacement costs and customer dissatisfaction.
3.19.3 Timing. Growth tests require prototype samples to test and time to formulate and implement changes based on the test results, so they should be considered in the latter stages of Design/Development. This testing should precede any qualification tests, which, if performed, should serve to demonstrate that the growth program was satisfactory. Many manufacturers perform growth testing in lieu of demonstration testing, letting the measurements from the growth test provide assurance that adequate levels of reliability have been achieved.
3.19.4 Application Guidelines. As an assessment tool, RGT/TAAF should be used when technology or risk of failure is critical to the success of the product. The question of how long of a growth test is required to meet a desired reliability goal is addressed by reliability growth theories. The two most implemented methodologies are the Duane and the AMSAA growth models.
Duane Model. The first theory of reliability growth was developed by James T. Duane, who noted that the reliability of products in development tests, as measured by failurerate, plotted as a straight line against cumulative test time (the total test time obtained by adding the time on all units) on log-log paper. The characteristics of the cumulative and instantaneous failure rates of the Duane model are presented in Table 39.
Table 39. Duane Model - Cumulative and Instantaneous Failure Rates
| Characteristics |
General Form |
Example |
| Cumulative Failure Rate |
• Includes effects of all failures, including those whose root cause has been eliminated through corrective action implementation and verification
• Pessimistic indicator of the current product failure rate. |
λcum = K T-α
where,
α = Growth Rate
K = Initial Failure Rate
T = Test Time |
Assume the initial failure rate (K) is 0.01 failures per hour, the growth rate (α) is equal to 0.5, and the elapsed test time (T) is 1,000 hours.
The cumulative failure rate at 1,000 hours is:
λcum = (.01)(1000)-0.5
= (.01)(.03)
= .0003 failures per hour |
| Instantaneous Failure Rate |
• Represents the failure rate expected at a particular time
• Defined as the rate of change of the number of failures as a function of time |
λinst = K (1 - α)T-α
where,
α = Growth Rate
K = Initial Failure Rate
T = Test Time |
Assume the initial failure rate (K) is 0.01 failures per hour, the growth rate (α) is equal to 0.5, and the elapsed test time (T) is 1,000 hours.
The instantaneous failure rate at 1,000 hours is:
λinst = (.01)(1 - 0.5)(1000)-0.5
= (.01)(.5)(.03)
= .00015 failures per hour |
Instantaneous failure rate plotted against cumulative test time is also a straight line on log-log paper, parallel to the cumulative failure rate plot. An example is shown in Figure 33.
To predict how long of a growth test is required to achieve a desired failure rate (or MTBF), the plot of the instantaneous failure rate (or MTBF) can be extended until it intersects the desired value, with the corresponding cumulative test time read from the x-axis. Alternately, the data points can be fitted to a straight line and the intersect point calculated. The equations required to apply this methodology are shown in Table 40. The three equations define a line fitting the data with least square deviation from the data points.
Figure 33. Example Duane Growth Plot
Table 40. Equations for Calculating Duane Growth Parameters of Reliability
| Y = C1 + C2 X |
Equation for a straight line, where,
Y = log of the cumulative failure rate
C1 = log of K (initial failure rate)
C2 = -α (slope of line)
X = log of the cumulative test time |
 |
Equation to compute slope, where,
Xi = log of individual failure time
Yi = log of cumulative failure rate at Xi failure time
n = number of recorded failures |
| C1 = Y - C2 X |
Equation to compute intercept, where,
Y = mean value of Yi
X = mean value of Xi |
Planning the length of a growth test before data is available requires the estimation of (K) and (α). These are best obtained from experience of the manufacturer in past growth programs. Historically, (α) has ranged from about 0.3, with 0.6 being a reasonable estimate of the maximum growth that could be realistically expected. The value of (K) has been observed to be as low as 10% of predicted reliability, but this does not account for current technology, such as computer aided design techniques, which effectively start the growth process when the product has only a conceptual existence.
Duane plots can be made using MTBF rather than failure rate as the parameter of interest. Since MTBF = 1/(λ), the log of the reciprocal of the failure rate is used for the Y-axis, and the plot goes up with time (slope is positive).
AMSAA Growth Model. The U.S. Army Material Systems Analysis Activity (AMSAA) modeled growth as a non-homogeneous Poisson process with the equations given in Table 41.
Table 41. AMSAA Growth Model Characteristics
| General Form |
Example |
| Cumulative Failure Rate |
λcum = λT
where,
β = Growth Rate
λ = Initial Failure Rate
T = Test Time |
Assume the initial failure rate ( ) is 0.01 failures per hour, the growth rate ( ) is equal to 0.5, and the elapsed test time is 1,000 hours.
The cumulative failure rate at 1,000 hours is:
λcum = (.01)(1000)(.5-1) = (.01)(1000)(-.5)
= .0003 failures per hour |
| Instantaneous Failure Rate |
λinst = λβTβ-1
where,
β = Growth Rate
λ = Initial Failure Rate
T = Test Time |
Assume the initial failure rate (λ) is 0.01 failures per hour, the growth rate (β) is equal to 0.5, and the elapsed test time is 1,000 hours.
The instantaneous failure rate at 1,000 hours is:
λinst = (.01)(.5)(1000)(.5-1)
= (.01)(.5)(.03)
= .00015 failures per hour |
| The parameters (λ) and (β) are estimated from the maximum likelihood formulas: |
 |
Equation to compute slope, where,
N = number of recorded failures
T = total test time
Xi = time at which an individual failure occurs |
| λ = N / Tβ |
Equation to compute intercept, where,
N = number of recorded failures
T = total test time
β = computed slope |
Given these two parameters, the instantaneous failure rate equation can be used to estimate the time required to achieve a given failure rate. The AMSAA model also plots as a straight line on log-log paper for both cumulative and instantaneous failure rates.
SECTION FOUR - REFERENCES
The references in Table 42 provide additional information on the subjects discussed in this Blueprint. The relationships between the reference and sections within the Blueprint are indicated in the table for each source.
Excerpt from "Table 42. References for Assessing Reliability Progress" See Full Version
|
|
|
|