This is just an Excerpt from a larger document, click here to view the entire document.
Practical Methods to Verify Normal Assumptions

In this section we discuss several empirical and practical methods for assessing the validity of two important and widely used distributions: the Normal and Lognormal. We illustrate this validation process via the life test data in Table 3. This sample (n = 45) was taken from the Normal (20, 7.6) process that generated Figure 1, presented in Section 2.

Table 3. Large Sample Life Data Set (sorted)
 6.1448 6.6921 6.7158 7.7342 9.6818 12.3317 12.5535 13.0973 13.6704 14.0077 14.7975 15.3237 15.5832 15.7808 15.7851 16.2981 16.3317 16.8147 16.886 17.5166 17.5449 17.9186 18.5573 18.8098 19.2541 19.5172 19.7322 21.9602 23.2046 23.2625 23.7064 23.9296 24.8702 25.2669 26.1908 26.9989 27.4122 27.7297 28.0116 28.2206 28.5598 29.5209 30.008 31.2306 32.5446

In our data set, two distribution assumptions need to be verified or assessed: (1) that the data are independent and (2) that they are identically distributed as a Normal.

The assumption of independence implies that randomization (sampling) of the population of devices (and other influencing factors) must be performed before placing them on test. For example, device operators, times of operations, weather conditions, location of the devices in warehouses, etc. should be randomly selected so they become representative of these same characteristics and of the contexts in which devices will normally operate.

To assess the Normality of the data, we use informal methods, based on the properties of the Normal distribution. They seem appropriate for the practical engineer, since they are largely intuitive and easy to implement.

To assess data, we must first obtain their descriptive statistics (Table 4). Then, we analyze and plot the raw data in several ways, to check (empirically but efficiently) if the Normality assumption holds.

There are a number of useful and easy to implement procedures, based on well-known statistical properties of the Normal distribution, which help us to informally assess this assumption. These properties are summarized in Table 5.

Table 4. Descriptive Statistics of Data in Table 3
Statistics Normal Sample
N 45
Mean 19.50
Median 18.56
Std. Dev. 7.05
Minimum 6.14
Maximum 32.54
Q1 15.06
Q3 25.73

Where Mean is the average of the data and the Standard Deviation is the square root of:

S2 = [Σ (xi - x)2 ] / [n - 1]

Table 5. Some Properties of the Normal Distribution
 Mean, median, and mode coincide; hence, sample values should also be close. Graphs should suggest that the distribution is symmetric about the mean. About 70% of the data should be within one standard deviation of the mean. About 95% of the data should be within two standard deviations of the mean. About 1% of the data, should be beyond three standard deviations of the mean. Plots of the Normal probability and Normal scores should be close to linear. Regressions of these probability and score plots should yield Unit slope.

First, from the descriptive statistics in Table 4, we observe that the sample Mean (19.5) and Median (18.56) are close, and how the Standard Deviation is 7.05. This supports the Normality of the distribution by Property No. 1, in Table 5.

The distribution looks symmetric about mean = 19.5, as suggested by the following Box Plot (plot of minimum, Q1, median, Q3, and maximum). Observe how the centered 50% of the data (between Q1 = 15.06 and Q3 = 25.73) is dispersed about the mean.

The histogram (Figure 3) also suggests some symmetry about Mode = 18 (center of the interval with the highest frequency in Figure 3). All of which, by Property No. 2 in Table 5, suggests the validity of the Normal distribution.

Figure 3. Histogram of the Normal Data Set (Mode is 18) (Click to Zoom)

The interval defined by one standard deviation about the mean: (μ - σ, μ + σ) = (19.5 - 7.05, 19.5 + 7.05) = (12.4, 26.1) includes 28 values (in ranks 7 to 34, of sorted Table 3) representing 62% of the total data set (close to the expected 68.25%). The interval (μ - 2σ, μ + 2σ) = (5.4, 33.6) includes values in ranks 1 to 45 (i.e., all data) representing 100% of the data set (close to the expected 95%). There are zero values beyond μ±3σ, supporting the statement that about 1 point (about 1% of the values) would be outside the interval (μ - 3σ, μ + 3σ). All these results support Properties 3 to 5 of Table 5.

In the Probability Plot, the Normal Probability (PI) is plotted vs. I/(n + 1) where I is the data sequence order, i.e., I = 1,..., 45. Each PI is obtained by calculating the Normal probability of the corresponding failure data, XI using the sample mean (19.5) and the standard deviation (7.05). For example, the first (I = 1) sorted (smallest) data point is 6.15:

P19.5,7.05(6.15) = Normal[(6.15 - 19.5) / 7.05] = Normal(-1.89) = 0.029

The data point is then plotted against the corresponding I/(n + 1) value, 1/46 = 0.0217 and so on, until done with all sorted sample elements I = 1,..., 45.

When the population is Normal, the Probability Plot (Figure 4) follows an upward linear trend, with unit slope. Hence, the linear regression of the Normal Probability vs. Data Rank must also reflect this one-to-one relation, via achieving a unit slope:

 NormProb = -0.0228 + 1.01 NormRank Predictor Coef Std. Dev. T P Constant -0.02282 0.01192 -1.91 0.062 NormRank 1.00783 0.02076 48.54 0.000 S = 0.03933 R-Sq = 98.2% R-Sq(adj) = 98.2%

The regression Index of Fit (R2 = 98.2%) is very high (close to 100%). Also, the P-value (0.0) for the NormRank regression coefficient T-Test (48.54) is very small, thus suggesting a linear trend. The regression coefficient (slope) itself (1.00783) is close to Unit, suggesting the Normal as the data statistical distribution. This regression slope Unit value serves as the litmus test of this graphical approach to assess Normality.

Figure 4. Plot of Normal Probability (PI) vs. I/(n + 1); I = 1, ... , n; Close to Linear, as Expected from a Normal (Click to Zoom)

The Normal scores XI are the percentiles corresponding to the values I/(n + 1), for I = 1, , n; calculated under the Normal distribution (using mean = 19.5, std-dev = 7.05). For our example, the first I/(n + 1) is 1/46 = 0.0217 and the smallest data point = 6.15:

P19.5, 7.05(Xi) = Normal[(Xi - 19.5) / 7.05] ≈ [i / (n + 1)] = 0.0217

→ Percentile(0.0217) = -2.02 = [(Xi - 19.5) / 7.05]

Solving in the above equation for scores XI we get the first (I = 1) Normal score:

X1 = -2.02 x 7.05 + 19.5 = -14.24 + 19.5 = 5.26

These Normal scores are then plotted vs. their corresponding sorted data values (Figure 5). In the above example, score 5.26 is plotted against 6.15 (the smallest data point) and so on, for I = 1, , n. When the data come from a Normal Distribution, the Normal Scores plot is close to a straight line (Property 6).

Figure 5. Plot of the Normal Scores vs. the Sorted Real Data Values, Close to Linear (Click to Zoom)

We regress the Normal Scores vs. the corresponding data. The regression, if the data comes from the Normal distribution, should yield a unit slope:

 NormScore = 0.487+ 1.00 NormSamp Predictor Coef Std. Dev. T P Constant 0.4872 0.4554 1.07 0.291 NormRank 1.00042 0.02199 45.50 0.000 S = 1.028 R-Sq = 98.0% R-Sq(adj) = 97.9%

An Index of Fit R2 = 97.9% and a regression coefficient 1.0042, plus the Normal Probability and Normal Scores plots, suggest that the assumption of a Normal distribution is reasonable.