This is just an Excerpt from a larger document, click here to view the entire document.Putting the Problem in Perspective
Assume that we are given the task of estimating the mean life of a device. We may provide a simple point estimator, the sample mean, which will not provide much useful information. Or we may provide a more useful estimator: the confidence interval (CI). This latter estimator consists of two values, the CI upper and lower limits, such that the unknown mean life, μ, will be in this range, with a prescribed coverage probability (1 - α). For example, we say that the life of a device is between 90 and 110 hours with probability 0.95 (or that there is a 95% chance that the interval 90 to 110, covers the device true mean life, μ).
The accuracy of CI estimators depends on the quality and quantity of the available data. However, we also need a statistical model that is consistent with and appropriate for the data. For example, to establish a CI for the Mean Life of a device we need, in addition to sufficiently good test data, to know or assume a statistical distribution (e.g., Normal, Exponential, Weibull) that actually fits these data and problem.
Every parametric statistical model is based upon certain assumptions that must be met, for it to hold true. In our discussion, and for the sake of illustration, consider only two possibilities: that the distribution of the lives (times to failure) is Normally distributed (Figure 1) and that it is Exponentially distributed (Figure 2). The figures were obtained using 2000 data points, generated from each of these two distributions, with the same mean = 100 (and for the Normal, with a Standard Deviation of 20).
Figure 1. Normal Distribution of Times to Failure (Click to Zoom)
Figure 2. Exponential Distribution of Times to Failure (Click to Zoom)
There are practical consequences of data fitting one or the other of these two different distributions. "Normal lives" are symmetric about 100 and concentrated in the range of 40 to 160 (three standard deviations, on each side of the mean, which comprises 99% of the population). "Exponential lives", on the other hand, are right-skewed, with a relatively large proportion of device lives much smaller than 40 units and a small proportion of device lives larger than 200 units.
To highlight the consequences of choosing the wrong distribution, consider a sample of n = 10 data points (Table 1). We will obtain a 95% CI for the mean of these data, using two different distribution assumptions: Exponential and Normal.
Table 1. Small Sample Data Set
5.950
119.077
366.074
155.848
30.534
20.615
15.135
3.590
103.713
120.859
The statistic "sample average", x = 94.14, will follow a different sampling distribution, according to whether the Normal or the Exponential distributions are assumed for the population. Hence, the data will be processed twice, each time using a different formula. This, in turn, will produce two different CI that will exhibit different confidence probabilities.
Normal Assumption. If the original device lives are assumed distributed Normal (with σ = 20), the 95% CI for the device mean life μ, based on the Normal distribution is:
(81.7, 106.5)
Exponential Assumption. If, however, the device lives are assumed Exponential, then the 95% CI for the mean life θ, based on the Exponential, is:
(55.11, 196.3)
We leave the details of obtaining these two specific statistics or "formulas" for another paper.
Since in reality the ten data points come from the Exponential, only the CI (55.11, 196.3) is correct and its coverage probability (95%) is the one prescribed. Had we erroneously assumed Normality, the CI obtained under this assumption, for this small sample, would have been incorrect. Moreover, its true coverage probability (confidence) would be unknown and every policy, derived under such unknown probability, is at risk.
This example illustrates and underlines how important it is to establish the validity (or at least the strong plausibility) of the underlying statistical distribution of the data.