All data requires expert evaluation before sets from different conditions are combined. All reliability analysts strive to gather as much data on a product as possible to make as realistic an assessment of the acceptability of the product as possible. Consequently, they often want to combine predicted, test, and operating data for current, new, modified, and similar products. Problems are encountered when non-homogeneous (heterogeneous) data are combined to represent a new product. One of the important engineering tasks for improved reliability is identifying design or product defects prior to production and operation. Data analysis is one of the methods used to determine the shortcomings of the process, which is why accurate and proper data combinations are necessary. The purpose of this START sheet is to introduce some of the techniques and pitfalls of data combination.
Background
Combining similar data sets for the purpose of establishing confidence intervals, estimating or forecasting values, modeling data or establishing distributions (goodness of fit), is very appealing. For as we all know, larger data sets provide more information, allowing us to obtain better estimates and more refined values. But this is only true if the information is consistent, of good quality, and comes from similar populations.
Unfortunately, many analysts end up combining information that resembles "apples and oranges," for the data may be similar only in appearance. For example, "field data" from a particular device may be combined with its laboratory data. However, if these data are from different conditions and developed from different levels of product maturity, combining these data may be counterproductive.
For, by combining such data, additional "noise" is introduced into the data set. The extra noise increases the variance and therefore, also increases the uncertainty and the size of the confidence interval. In such cases, it is worse to combine the data sets than to analyze them separately.
Consider the field to laboratory data problem previously mentioned. What are the environmental conditions involved? Are these data from a device that is operating in the jungle, in the Arctic, in the dessert, near the sea, or during winter or summer? Such environmental conditions will undoubtedly affect the reliability and life of most devices, due to the effects of heat, corrosion, contaminates, etc.
Then, consider the problem of operational stress. Under what conditions was this device operating? Are these data from training or normal missions, or rather from stressed and unusual conditions such as combat? Certainly, these operational conditions will also affect the reliability and life of most devices.
Also consider maintenance conditions. Under peacetime conditions, preventive maintenance (PM) is performed on a regular basis and a product may be out of use until a part is obtained. Under combat conditions, a system will receive PM as close as possible to schedule but certainly PM may be deferred if combat so requires. And if parts are required, they may be cannibalized from other systems. For the urgent need for the system to operate overrides any other considerations at such a time.
Implementation
To correctly combine several data sets, the analyst must perform an in-depth analysis of each set under consideration. That is, using an Exploratory Data Analysis (EDA) approach (introductory data analysis via tabular, graphical, and descriptive statistics), one assesses the data characteristics. This assessment establishes whether the population appears symmetric and unimodal, or skewed. Then, prospective statistical distributions for the parent population are established and estimates of the parameters are determined.
Using the estimated parameters, we then establish theoretical and qualitative differences and similarities between environments, operational profiles, product maturity, methods of testing and other factors. These differences and similarities are identified via confidence intervals and hypothesis test for the parameters, such as the mean, variance, median, etc. Finally, in-depth statistical analyses on each data set (such as analysis of variance, of covariance, regression modeling, goodness-of-fit tests, etc.) are performed to establish and quantify any statistical difference between the sets.
As a result of the aforementioned analyses, we combine only those data sets that do not show large statistical differences between associated distributions and their parameters, and where other similarities can be established. For example, data sets from different laboratory tests, that appear to come from the same distribution, such as a normal distribution, with the same mean and variance, when the tests are performed on similar devices in approximately equal time epochs, may be combined.
A summary of the implementation procedure is:
Perform an EDA analysis [see References 1, 7, and 9]
Perform graphical analysis [see Reference 8]
Perform goodness of fit analysis [see Reference 11]
Perform analysis of variance [see Reference 4]
Perform regression analysis [see References 2 and 3]
Quantify statistical differences [see References 5, 6, and 10]
In what follows, we present several numerical examples of the described analysis methodology, including the types of statistical procedures used to assess whether data sets are similar and can be combined, or are different and cannot be combined. Then, we discuss how different organizations have combined data in the past, and the consequences of such data combining.
Similar Sets: Example on Combining Data
The first example is the analysis of a small data set (denoted Ex3.dat) taken from the program RECIPE User's Guide
[Reference 5]. The data consist of 11 tensile strength measurements, taken at two different fixed levels of temperature (75°F and -67°F) and from the same production batch. In order to determine if these two subgroups differ, and must be analyzed separately, or whether they are similar, and can be pooled together, an exploratory data analysis (EDA) must be performed. If the data can be combined as a single set, the analyst will have more information and can thus obtain better results (such as tighter confidence intervals).
The eleven values of this data set are shown in Table 1.
Table 1. Tensile Strength Observations (Two Sets)
Row
Temp (F°)
Strength
1
75
328.117
2
75
334.767
3
75
347.783
4
75
346.266
5
75
338.731
6
75
340.815
7
-67
343.586
8
-67
334.175
9
-67
348.661
10
-67
356.323
11
-67
344.152
The eleven tensile strength values constitute two well-defined subgroups based on different temperatures. The analysis starts by obtaining the descriptive statistics by subgroups, as well as for the pooled data set. The values in Table 2 are determined.
Table 2. Analysis Statistics
Set (Temp)
N
Mean
Median
StDev
Min
Max
Q1
Q3
ex3 (75)
6
339.41
339.77
7.33
328.12
347.78
333.10
346.65
ex3 (-67)
5
345.38
344.15
8.07
334.17
356.32
338.88
352.49
ex3 (tot)
11
342.13
343.59
7.92
328.12
356.32
334.77
347.78
where temperature is in degrees Fahrenheit, N is the sample size, STDEV is the standard deviation, MIN and MAX are the minimum and maximum values in the sample, and Q1 and Q3 are the first and third quartiles (signaling the 25th and 75th sample percentiles).
Proceeding with the EDA, a Box and Whisker plot of the data, by temperature and with combined data, is developed and presented in Figure 1.
Figure 1. Boxplot of the Data (Click to Zoom)
These box plots suggest that there may be a difference between the group measurements, when subdivided by temperature. To verify this, we obtain the scatter plot of the data, which is shown in Figure 2.
Figure 2. Scatter Plot of the Data (Click to Zoom)
The scatter plot also suggests that tensile strength means may differ, when considered by temperature. However, variation within the two groups (variance) seems homogeneous. These two issues need to be investigated further, analytically. This investigation is performed by assessing whether or not the underlying populations follow a normal distribution.
Analyzing both subgroups (and the combined data) using the Anderson-Darling (AD) GoF tests [Reference 11] we determine if both groups are assumed normal. If they are, and with the same mean and variance, then the two sets can be combined. And their combination data can then be analyzed as a single group. Results of these analyses are presented in Figures 3, 4, and 5.
Figure 3. GoF Test for 1st Group (Click to Zoom)
Figure 4. GoF Test for 2nd Group (Click to Zoom)
Figure 5. GoF Test for Combined Group (Click to Zoom)
Both test p-values, obtained from the AD GoF tests, resulted in calculations greater than α = 0.05. Therefore, both temperature groups can be assumed normally distributed. Let's verify that their combination can also be assumed normally distributed. The results obtained from the AD Goodness of Fit test are summarized as follows:
Data Set
p-value
Decision
75°F (First Group)
0.83
normality assumed
-67°F (Second Group)
0.70
normality assumed
Combined Temp data set
0.93
normality assumed
Since the p-value is higher than α = 0.05, we cannot reject normality for the combined data set.
Next, the means of the two small size normal samples are compared. First, the Fisher F-test is used to compare the two variances and then the Student t-test is used to compare the means. If neither of these two mentioned tests can reject the hypotheses of equality (i.e., means and variances of the two groups are equal), then the data can be pooled together. And we can safely assume that both samples come from normal populations, with the same mean and the same variance. On the other hand, if any group data fails any of these two tests, we must analyze them separately.
First, test the hypothesis that both group temperature variances are equal (σ1 = σ2). This is performed by using the F test statistic, which is defined as the ratio of both sample variances. The two sample variances: S12 = (7.33)2 = 53.7289 and S22 = (8.08)2 = 65.1249, are obtained from the table of descriptive statistics. For practical reasons, we always place the largest variance in the numerator:
F= S22 / S12 = 65.12/53.72 = 1.21
Fisher's test requires the upper critical value F(n1 = 5, n2 = 6) =
5.99 (where n1 and n2 are the corresponding sample sizes, and determine the F test degrees of freedom). They can be found in any standard F-Table. The F test statistic result shown above (F = 1.21) is smaller than the upper critical value (5.99). Hence, we cannot reject the assumption of equal variances, for a significance level (or test error) = 0.05. Therefore, we consider that both temperature groups have the same variance.
Next, the Student t-test is used to assess the equality of the two group means (i.e., μ1 = μ2). This test is applicable when both samples are small. In our example, the samples come from independent and normal populations having equal variances, the two sample averages (or means) are x = 339.41 and y = 345.38, the sample sizes are n1 = 6 and n2 = 5 and the "pooled" sample standard deviation (Spool) is 7.67.
Therefore, the Student t-test statistic is as follows.
The two sample Student t-test for the comparison of the two temperature means (formula shown earlier), yields a statistic value of t = -1.29. Such a t-test result has a p-value (probability of erroneous rejection) of 0.23, larger than the error level = 0.05.
Since the test p-value is larger than = 0.05, we cannot reject the null hypothesis that the group means are equal (μ1 = μ2). Since the two temperature groups are normally distributed, with the same mean and variance, we can safely assume that both samples come from the same normal population and can be pooled. Any performance measure of interest can now be obtained, assuming that the pooled data come from the normal population, with parameters μ = 342.12, σ = 7.92.
Dissimilar Sets: Example on Not Combining Data
Data set Ex5.dat is taken from the RECIPE Statistical Software [Reference 5] Program Users Guide (and discussed in Section 8.3.7.9 of [Reference 6]). The data set consists of 15 tensile strength observations, from five sequentially produced batches from two different manufacturers as shown in Table 3.
Table 3. Tensile Strength (two sets, five subsets)
Row
Manufacture
Batch
Strength
1
1
1
75.8
2
1
1
78.4
3
1
1
82.0
4
1
2
68.8
5
1
2
70.9
6
1
2
73.5
7
1
3
74.5
8
1
3
74.8
9
1
3
78.8
10
2
4
81.3
11
2
4
87.7
12
2
4
89.0
13
2
5
88.2
14
2
5
91.2
15
2
5
94.2
The first step is to determine whether the Ex5.dat set is homogeneous (same kind), so that we can use all of it to obtain measures of central tendency, dispersion, and other parameters that characterize it. The underlying distribution and its parameters need to be assessed to ascertain whether there are outliers in the combined data set.
If data sets are homogeneous, we can and want to combine them. If data are heterogeneous (that is, vary by manufacturer, by batch, or by both), then combination is ill-advised. If such variation exists, we also want to know the reasons for this variation (i.e., if they are caused by a time trend or by some factor such as the manufacturer). This additional information can be used to validate or forecast a tensile value. The descriptive statistics for the data are shown in Table 4.
Table 4. Descriptive Statistics for Strength Data
N
Mean
Median
STDEV
Min
Max
Q1
Q3
15
80.61
78.80
7.860
68.80
94.20
74.50
88.20
Note: N is the sample size of the data set, STDEV is the standard deviation, Min and Max are the minimum and maximum values in the sample, and Q1 and Q3 are the first and third quartiles (signaling the 25th
and 75th
percentiles of the population).
The next step is to plot the data in various ways (pooled, by individual groups, etc.) to obtain a first diagnostic about how the data sets are similar or about how they differ. This is done in the Box and Whiskers plot (known for short as boxplot) as shown in Figure 6 for the combined data.
Figure 6. Box Plot of the Combined Data (Click to Zoom)
The boxplot of the combined data set, Figure 6, shows, as a "box" the values comprising the centered 50% of the data (between Q1 and Q3). The "plus" sign inside this box is the sample median and the lines (whiskers) cover the lower and upper 25%, to the minimum and maximum sample values, respectively. This boxplot suggests a flat and symmetric distribution, with heavy tails. The median and mean are close and the data are spread out, as shown by the extended upper/lower quartiles. Redoing the boxplot by manufacturer subset, as shown in Figure 7, some reasons for the data variability become apparent.
Figure 7. Box Plot by Manufacturer (Click to Zoom)
The data descriptive statistics, obtained by manufacturer's groups are shown in Table 5.
Table 5. Descriptive Statistics for Strength Data
Manuf.
N
Mean
Median
StDev
Min
Max
Q1
Q3
1
9
75.28
74.80
4.07
68.80
82.00
72.20
78.60
2
6
88.60
88.60
4.30
81.30
94.20
86.10
91.95
The boxplots and the descriptive statistics suggest that there are differences in the tensile strengths of the two (manufacturers) groups, which is also clearly apparent in the scatter plot shown in Figure 8, as manufacturer two has higher tensile strength values.
Figure 8. Scatter Plot for Strength Data by Manufacturer (Click to Zoom)
Exploring further, the data from each manufacturer was broken down by batches. This evaluation, shown in Figure 9, confirms that a difference exists between the two manufacturers. In addition, there are batch differences within the two manufacturers units. It is apparent from Figure 9 that the batches differ by manufacturer.
Figure 9. Box Plots by Subgroups (Click to Zoom)
An Anderson-Darling (AD) Goodness of Fit (GoF) test for Normality [See Reference 11] was performed for the entire data set. Results indicate an AD = 0.33 with a p-value = 0.47, much higher than the α required to reject normality as a plausible data distribution. The reject values are generally below α = 0.05. The boxplots do not suggest the presence of outliers. An AD GoF test for each manufacturer was performed, obtaining AD values of 0.15 and 0.28 respectively, with p-values of 0.93 and 0.50. With such results, we cannot reject normality, a required distribution for the implementation of comparison between groups, via two-sample t-tests and Analysis of Variance (ANOVA).
First, the two device manufacturer data sets are compared via a two-sample t-test. The two group variances are very similar as shown by the following descriptive statistics and hence, are assumed equal.
Manuf.
N
Mean
StDev
SE Mean
1
9
75.28
4.07
1.4
2
6
88.60
4.30
1.8
The t-test yields a p-value = 0.00 which is less than α = 0.05 and the 95% confidence interval (-18.1, -8.6) for the differences between the two manufacturers' means (i.e., manuf-1 and manuf-2). These results show that the second manufacturer's material has a tensile strength mean that is between 8.6 and 18.1 units stronger than that of the first, with 95% confidence.
Within each of the two manufacturers, graphical analysis shows some differences between batches. Exploring them, analytically, via ANOVA, for manufacturer 1, there is a statistical difference (p-value = 0.032, less than 0.05) between batches 1 & 2. It is quite apparent in the ANOVA graph, Figure 10.
Source
DF
SS
MS
F
p
bat-1
2
90.74
45.37
6.48
0.032
Error
6
42.00
7.00
--
--
Total
8
132.74
--
--
--
Figure 10. ANOVA Results (Click to Zoom)
For manufacturer 2, however, the two batches appear to come from the same population (p-value = 0.152 larger than 0.05). This result, see Figure 11, suggests that their production process is more homogeneous (controlled) than that of manufacturer 1. Further investigation with more batches is recommended.
Source
DF
SS
MS
F
p
bat-2
1
40.6
40.6
3.12
0.152
Error
4
52.0
13.0
--
--
Total
5
92.5
--
--
--
Figure 11. ANOVA Within Manufacturer (Click to Zoom)
We now plot the sequentially obtained batch means, versus time (Figure 12).
Figure 12. Sequentially Obtained Batch Means (Click to Zoom)
The scatter plots show an increasing trend among the sequential batches, as time increases. We perform a regression analysis on the time series of the combined data set. Statistically significant results (i.e., small p-values) are presented in Table 3. The regression equation is: strength = 68.6 + 3.99 batch.
Regression analysis shows an effect of time on the mean strength of the batch. The index of fit is R2 = 55%. The regression model explains over half of the data variation. The regression tests are highly significant (p-values are practically zero) and therefore suggest that the data sets are not homogeneous. Again, there are two important caveats to be made. First, we must examine the residual plot, to assess that all regression model assumptions are met. But second and more importantly, we need to ask ourselves: Do these results have a sound engineering basis? Do they make engineering sense?
Table 3. Regression Analysis
Predictor
Coef
StDev
t-ratio
p
Constant
68.647
3.306
20.77
0.000
batch
3.9867
0.9967
4.00
0.002
s = 5.459
R-sq = 55.2%
R-sq(adj) = 51.7%
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
1
476.81
476.81
16.00
0.002
Error
13
387.40
29.80
Total
14
864.21
In conclusion, the data sets do not appear homogeneous and should not be analyzed in a combined fashion, but separately by manufacturer. Even more, importantly the analysis shows that there are some differences even within each manufacturer.
Example of Field to Predicted Data
We now discuss, using some specific examples, several important issues regarding the field operating data and reliability prediction models. These issues are particularly relevant when dealing with reliability prediction, test, and life data.
For example, in the late 1980s, a study was conducted that compared predicted and field MTBFs in an attempt to quantify the uncertainty associated with the mentioned reliability predictions. This study was a "snapshot" in which both predicted and field MTBF system data was analyzed.
Because of the fragmented nature of the part and environmental data used in this study, and the fact that it was often necessary to interpolate or extrapolate from the available data when developing new models, statistical confidence intervals associated with the overall (combined) model results are greatly compromised. In addition to the variability associated with developing the models, there is human variability in making prediction and judgment assumptions about including or excluding of field failures, and failure definitions. As a result, the validity of confidence interval assumptions and, therefore, of its confidence levels can be seriously questioned.
The original data used to develop the confidence intervals was based on approximately 200 reliability predictions performed during the 1970s and 1980s and documented in a study sponsored by Rome Air Development Center (RADC) entitled "Reliability and Maintainability Operational Parameter Translation II," RADC-TR-89-299. It should also be remembered that the predictions performed on these 200 systems were developed a number of years ago, by a wide range of individuals, under many different assumptions. In addition, at that time, operating modes and other factors may have also been very different than what they are today, which is why combining data sets is so critical.
Field MTBFs used in the study introduce more variability with a wide range of operating hours, failure counts and maintenance policies for each system. Therefore, the study results could very well be different if reconstructed today using a statistical analysis approach as presented in this report. It serves only to provide a notion of the variability possible across a wide range of systems, companies, individuals and field maintenance policies that need to be analyzed. The results could be much better, say, if a single experienced reliability engineer were applying a standard prediction tool over a long period of time, and there was like consistency in field failure counting practices. But such information is not available.
Part failure models in MIL-HDBK-217, Telcordia and PRISMŽ
and other reliability prediction techniques are based on part data from numerous sources, environments and time epochs. Complete models are never developed under a single study contract, and the failure data do not come from a single source. For example, all MIL-HDBK-217 environmental factors were developed under study efforts separate from the one in which the part failure models were developed. Statistical studies for combining these data were never performed, so incompatibilities in data sets were never identified.
In addition, adding vendor and field failure rate data to the combination, results in a mixed prediction that may or may not represent the "new" design. Outside data sources are usually from units or components that have been previously developed and can be similar to the new but may have different technologies and, hence, have an indeterminable correlation to the "new" design.
Concerns for Further Study
Several important caveats regarding combining data from several sources to develop statistical models, in general, and regression models, in particular, were discussed. The two most important caveats are that (1) data should only be combined when the engineering and statistical analysis support such combinations, and that (2) the statistical model should always follow reality, not the other way around. If care is not taken, an engineer might end up modeling the data and not the problem.
Also, several statistical procedures (t and F tests, AD GoF test, regression and ANOVA) have been described in some detail, and implemented, in the analysis of the illustrative data. Some procedures (e.g., AD) are discussed in-depth in other RIAC START sheets [11]. Others (e.g., t and F tests, ANOVA and regression) have been referenced in other bibliographic sources, many of them developed by and available at the RIAC [7, 8, 9, and 10]. In addition, these topics will be, in the near future, also discussed in detail in other RIAC START sheets.
Bibliography
Box, G.E.P., W.G. Hunter, Statistics for Experimenters, John Wiley, NY, 1978.
Chatterjee, S. and B. Price, Regression Analysis by Example, John Wiley, NY, 1977.
Draper, N. and H. Smith, Applied Regression Analysis, John Wiley, NY, 1980.
Dixon, W.J. and F.J. Massey, Introduction to Statistical Analysis, McGraw Hill, NY, 1983.
A User's Guide to RECIPE: A FORTRAN Program for Determining One Sided Tolerance Limits for Mixed Models with Two Components of Variance, Vers. 1.0, Vangel, M.G., National Institute of Standards and Technology, SED, July 1994.
MIL-HDBK-5, Metallic Materials and Elements.
Practical Statistical Tools for Reliability Engineers, Coppola, A., RIAC, 1999.
A Practical Guide to Statistical Analysis of Material Property Data, Romeu, J.L. and C. Grethlein, AMPTIAC, 2000.
Mechanical Applications in Reliability Engineering, Sadlon, R.J., RIAC, 1993.
Confidence Bounds for System Reliability, Romeu, J.L., RIAC SOAR-4, Spring 1985.
Dr. Jorge Luis Romeu has over thirty years of statistical and operations research experience in consulting, research, and teaching. He was a consultant for the petrochemical, construction, and agricultural industries. Dr. Romeu has also worked in
8 statistical and simulation modeling and in data analysis of software and hardware reliability, software engineering, and ecological problems.
Dr. Romeu has taught undergraduate and graduate statistics, operations research, and computer science in several American and foreign universities. He teaches short, intensive professional training courses. He is currently an Adjunct Professor of Statistics and Operations Research for Syracuse University and a Practicing Faculty of that school's Institute for Manufacturing Enterprises.
For his work in education and research and for his publications and presentations, Dr. Romeu has been elected Chartered Statistician Fellow of the Royal Statistical Society, Full Member of the Operations Research Society of America, and Fellow of the Institute of Statisticians. Romeu has received several international grants and awards, including a Fulbright Senior Lectureship and a Speaker Specialist Grant from the Department of State, in Mexico. He has extensive experience in international assignments in Spain and Latin America and is fluent in Spanish, English, and French.
Romeu is a senior technical advisor for reliability and advanced information technology research with Alion Science and Technology previously IIT Research Institute (IITRI). Since joining Alion in 1998, Romeu has provided consulting for several statistical and operations research projects. He has written a State of the Art Report on Statistical Analysis of Materials Data, designed and taught a three-day intensive statistics course for practicing engineers, and written a series of articles on statistics and data analysis for the AMPTIAC Newsletter and RIAC Journal.
Bruce Dudley is a senior engineer for Alion Science and Technology and serves as an advisor to the Reliability Analysis Center. In this capacity, he has developed new guides for defining reliability programs through the publication of the "Blueprints for Product Reliability," assisted in revising design handbooks for reliability, edited new products such as the "Introduction to Software Reliability," and designed accelerated reliability test programs for commercial products.
Before joining Alion, he was a member of the staff at the Air Force Laboratory for 34 years, where he was responsible for developing reliability and maintainability engineering techniques.
Mr. Dudley holds a bachelor's degree in electronic engineering from Worcester Polytechnic Institute. He is a member of IEEE.