Chapter 23 One Factor Analysis of Variance

Preparing to load PDF file. please wait...

0 of 0
100%
Chapter 23 One Factor Analysis of Variance

Transcript Of Chapter 23 One Factor Analysis of Variance

Chapter 23
One Factor Analysis of Variance
This Chapter extends the results of Chapter 19 from two populations to an arbitrary number of populations.
23.1 The Hypotheses
Chapter 19 considers the problem of comparing a numerical response from two populations. Often in science, one wants to compare more than two populations. Let k > 2 be the (integer, of course) number of populations to be compared.
Let’s review what we developed in Chapter 19. Our goal was to compare two populations, denoted by population 1 and population 2. (Recall that if we want to study these populations individually, we can use the methods presented in Chapters 17 and 18.) We denote the mean of population 1 [2] by µ1 [µ2]. We make the decision to compare the populations by comparing their means, either through testing or estimation.
In the current chapter we have k populations, which we call: population 1, population 2, . . . , population k.
The means of these populations are denoted by µ1, µ2, . . . , µk, respectively.
For testing, the null hypothesis in Chapter 19 is H0: µ1 = µ2.
In this chapter, we generalize this null in the obvious way: H0: µ1 = µ2 = . . . = µk.
For example, if k = 5, then our null hypothesis is H0: µ1 = µ2 = µ3 = µ4 = µ5.
Specifying the alternative hypothesis is a bit tricky. In Chapter 19, there were three options for the alternative hypothesis:
621

Figure 23.1: The µ2 versus µ1 coordinate system. µ2
µ2 = µ1
µ1 < µ2
µ1 > µ2
µ1
H1: µ1 > µ2; H1: µ1 < µ2; or H1: µ1 = µ2. Recall also that the first two of these sometimes are referred to as one-sided alternatives and the last of these is the two-sided alternative.
A nice feature of the Chapter 19 problem is that because we have only two populations and two means we can actually graph the space of possible values of the means, and I do so in Figure 23.1. The horizontal [vertical] axis represents the possible values of µ1 [µ2].
In Figure 23.1 I have drawn the line that corresponds to our Chapter 19 null hypothesis—the line for which µ1 = µ2. To emphasize; this line is the graph of all the pairs of means that make the null hypothesis correct.
(Note that my drawing of the line might suggest to you that the means are restricted to be nonnegative numbers. That is not my intent. In math, lines and the coordinate system go on forever, although they rarely (never?) do in science.)
The line µ2 = µ1 divides coordinate system into two pieces; below and to the right of the line corresponds to µ1 > µ2; and above and to the left of the line corresponds to µ1 < µ2. This division justifies the idea of allowing the alternative to be > or <; as well as explains why it is common to refer to these as one-sided alternatives.
I will not draw a picture for the case when k = 3 populations are being compared. Having lived your whole life in a three dimensional world, you know what it looks like! The idea is that the graph of the null hypothesis, µ1 = µ2 = µ3 is a line and, in three dimensions, a line does not divide the entire space into two pieces in any natural way.
The best we can do is to divide the non-null part of the three dimensional space of means into
622

six regions:

µ1 > µ2 > µ3 µ2 > µ1 > µ3 µ3 > µ1 > µ2

µ1 > µ3 > µ2 µ2 > µ3 > µ1 µ3 > µ2 > µ1

Even the above (somewhat nightmarish) listing is not complete; for example, µ1 = µ2 > µ3 is not part of null and would need to be included in any listing of non-null regions. I won’t attempt to list all of the non-null regions, not because it is too difficult, but because it is tedious and not worthy of your time. The point is: for k = 3 (or larger), there is no satisfactory way to have one-sided alternatives.
As a result, perforce, the only alternative we consider is the rather bland:

H1: not the null.

Even this statement requires care. Again, in the case with k = 3, the null states that all three means are equal. The alternative does not say that all three means are different; for example, if µ1 = µ2 = 10 and µ3 = 8, then the null is false and the alternative is true, but it is not the case that the means are three distinct numbers.

23.2 Notation for Data
We begin with the notation for the data we will collect. We will use the 25th letter of our alphabet to denote the response; the upper case Y to denote a random variable and the lower case y to denote a realization (observed value) of the random variable. Sadly, I will need to introduce you to the dreaded double subscripts.
Recall that we have k populations for comparison, with k > 2. The first subscript on Y [y] denotes the population from which it came:
• Y1 [y1] for all random variables [realizations] from the first population;
• Y2 [y2] for all random variables [realizations] from the second population;
• and so on, until we get to • Yk [yk] for all random variables [realizations] from the kth, last, population. (Please forgive
our awkward terminology; we call the last population the kth, even if k = 3 and we get the unusual 3th instead of the more common—and, literally, correct—third.)
The second subscript is a counter; in particular,
• Y1,1 is the first random variable to be observed from population 1;
• Y1,2 is the second random variable to be observed from population 1; and so on.
• Y2,1 is the first random variable to be observed from population 2;
623

• Y2,2 is the second random variable to be observed from population 2; and so on.
• In general, Yi,j is the jth random variable to be observed from population i.
I want to draw your attention to something above. When it comes to notation, math-types can be incredibly lazy. Above—and in this chapter—I
will separate our subscripts by commas; this is not commonly done. Textbooks I have seen that were written by mathematicians—and sadly, nearly all that were written by statisticians—do not use commas; they write Y22 and Yij instead of my more cumbersome Y2,2 and Yi,j. The virtue of their method is obvious—it is easy; appealing to the laziness that is so attractive to so many of our species. So, why do I object and do it differently?
There is a simple reason. Suppose that k > 10 and you find the expression Y111 in one of these books I abhor. What exactly do we have? Is it the eleventh observation from population 1? Or the first observation from population 11? Without commas, or some other device, who knows? In fairness—with my tongue planted firmly in my cheek—I must admit that I do not criticize theoretical mathematicians who write Yij because, frankly, they have no intention of ever replacing i and j with actual integers!
The next bit of notation we need is the obvious generalization of n1 and n2 from Chapter 19: namely,
• For i = 1, 2, . . . , k, the number of observation taken from population i is denoted by ni.
This presentation is, perhaps, getting too abstract. Let me illustrate the above ideas with some numbers. I choose to do this by taking on the role of Nature and generating data from distributions (populations) of my choosing. I have decided to have k = 4 populations, with sample sizes
n1 = n2 = n3 = n4 = 5.
I obtained my data from the website
https://www.random.org/gaussian-distributions/
and the data are presented in Table 23.1. Let me take a few moments to explain the entries in this table. Before I do so, I will present a restriction that I will follow in this chapter.
Note: In all examples, practice problems and homework in this chapter I will restrict attention to the case in which all sample sizes are the same; i.e.,
n1 = n2 = . . . = nk.
This makes the presentation simpler, and gentler, which I find especially attractive for a fairly complicated chapter that appears at the end of a long course. The results you learn in this chapter can be extended to the (very common) situation in which the sample sizes differ, but the methods become messier.
The portion of the table to the left of the double vertical lines presents the observations, yi,j for the various choices of i and j. For example, reading from the table we see that the third observation from the second population is y2,3 = 143. Next, let’s look at the right portion of the table.
624

Table 23.1: Generated data. The data are independent random samples from four Normal popula-

tions. The population means are µ1 = 160; µ2 = 140; µ3 = 120; and µ4 = 100. All populations

have the same variance, σ2 = 400.

j

i

1 2 3 4 5 yi,+ y¯i,·

s2i (ni − 1)s2i

1 132 190 139 176 158 795 159.0 595.0

2380

2 189 139 143 139 130 740 148.0 548.0

2192

3 136 112 127 152 143 670 134.0 235.5

942

4 128 82 88 72 105 475 95.0 484.0

1936

Total − − − − − 2680 − −

7450

First, we compute the sums across each row. We obtain, for example,

y1,+ = 132 + 190 + 139 + 176 + 158 = 795.

The ‘+’ in the subscript indicates that we have summed over that subscript, in the case of this table, over the second subscript. Next, we divide each value of yi,+ by its sample size, ni to obtain the entries in the column headed y¯i,·. Note that when we calculate these means, we change the second subscript from + to ·. The idea is that the bar signifies the mean, which implies that we have summed the numbers and, hence, there is no need for the + to remind us of summing.
I admit that the notation is messy and arbitrary, which is one of the reasons I don’t like teaching
this material! The next column presents the variance, s2i for each sample of data. You might ask, “Hey Bob,
why do you present the variances instead of the standard deviations?” This is a fair question. I
present the variances because I want the sums of squares, which appear in the last column, under
the heading (ni − 1)s2i .

My desire for these sums of squares will be explained in the Appendix to this chapter.

Note that my use of the term sum of squares is consistent with Chapters 21 and 22, because,

for example,

(n1 − 1)s21 = (n1 − 1)

(y1,j − y¯1,·)2 = n1 − 1

(y1,j − y¯1,·)2

which is the sum of squared error of the observations in the first sample.

23.3 Assumptions and the P-value
The F distribution is named in honor of Sir Ronald A. Fisher, who also gave us the test in Chapter 8. It is described at
http://en.wikipedia.org/wiki/F-distribution.
625

This entry is very technical; here is what is sufficient for our purposes.
• The F distribution has a pdf that with support on the nonnegative real numbers. (This is technical-speak for the fact that if a random variable has an F distribution, then it can never assume negative values.)
• An F distribution has two parameters, d1 and d2; and is written F (d1, d2).
• Each parameter is called a number of degrees of freedom is equal to a positive integer.
• The order of the parameters matters; for example, the F(3,5) distribution is not the same as the F(5,3) distribution.
If I want to refer to, say, the F(5,10) distribution, I will say,
The F distribution with its first degrees of freedom equal to 5 and its second degrees of freedom equal to 10.
An alternative way to say this is
The F distribution with numerator degrees of freedom equal to 5 and denominator degrees of freedom equal to 10.
The reason behind the use of the terms numerator and denominator is explained in the Appendix at the end of this chapter.
Technical Note: If one has a random variable, call it T , whose distribution is the t-distribution with m degrees of freedom, then its square, the random variable T 2, has the F(1,m) curve for its sampling distribution.
If you read the caption to Table 23.1 you will know where this section is going. Below are the mathematical assumptions that underlie the results of this chapter.
1. The data from each population are the results of observing i.i.d. trials.
2. The samples from the different populations are independent.
3. For i = 1, 2, . . . , k, the pdf of population i is the Normal curve with unknown mean µi and unknown variance σ2. Note that the populations are allowed to have different means, but they all must have the same variance. In the language of Chapter 19, the various populations are congruent Normal curves.
Of these assumptions, I would state that the first two are the most important. Next comes the assumption of a common variance in the k populations. The assumption of Normal populations allows the mathematical statistician to conclude that an F-distribution gives exact P-values, but as we learned in Chapter 17 quite often our approximations are good even if the population is not a Normal curve. I am not saying that the assumption of Normal populations is totally unimportant; rather it is complicated and subtle and beyond the scope of this one brief chapter.
626

As stated earlier in this chapter, we want to test the null hypothesis:

H0:µ1 = µ2 = . . . µk,

versus the alternative:

H1: The null hypothesis is false.

The test statistic is denoted by F and takes on values f . On the assumption that the null hypothesis is correct. the sampling distribution of the test statistic F is equal to the F distribution with parameters,
d1 = k − 1, and d2 = n − k.

In the above, n denotes the total number of observations in the study. Thus, obviously,

n = n1 + n2 + . . . + nk.

Given our earlier notation, it is natural to refer to this sum as n+. Thus, n = n+ which, given my restriction that all ni’s equal the same number, n = kn1.
For my generated date in Table 23.1, we have

k − 1 = 4 − 1 = 3 and n − k = 20 − 4 = 16.

Thus, the sampling distribution of the test statistic is the F(3,16) pdf. I am not going to give you the formula for the test statistic F . I will discuss it in the optional
Appendix at the end of this chapter, but feel free to ignore that material. I will say, however, that we obtain the P-value by calculating the area to the right of the observed value f . Note that I am giving you only one rule for finding the P-value, not the three rules that have been so popular in these notes. The reason? The rule is a function of the choice of alternative and in this chapter there is only one allowable choice for the alternative. Hence, one rule.
I will illustrate these ideas with the generated data in Table 23.1. Because I don’t give you the formula for the test statistic F, I need to give you a website that will do the computations for you. Fortunately, you are familiar with the website; it is our old friend vassarstats:

http://vassarstats.net/

Please go to the vassarstats website. In the left margin, locate the item ANOVA (fourth from the bottom of the list) and click on it. This action will take you to a new page; click on the first item listed:

One-Way ANOVA for up to five samples.

Stating the obvious, you will note that if you have a value of k that exceeds five, then you will not be able to use vassarstats.
After the aforementioned clicking you will be taken to a page headed

One-Way Analysis of Variance for Independent or Correlated Samples.

627

Table 23.2: Vassarstat’s ANOVA Summary for the generated data presented in Table 23.1. (Slightly edited.)

ANOVA Summary

Source

SS df

MS F P-value

Treatment 11,710 3 3,903.33 8.38 0.0014

Error

7,450 16 465.62

Total

19,160 19

Scroll down to the box labeled Setup. In the box labeled Number of samples in analysis type the value of k. Because I plan to illustrate the site with my generated data, I type ‘4.’ Next, I click on the box labeled Independent Samples and I am ready to enter my data!
After entering my data, I clicked on Calculate and lots of numbers appear! Some of the numbers are familiar to you, because you saw them in Table 23.1:
The various values of sample sizes (ni = 5), totals (yi,+), means (y¯i,·) and variances (s2i ).
Scrolling farther down, we find what vassarstats labels the ANOVA Summary and what has traditionally been called the ANOVA Table. The ANOVA Summary is reproduced in Table 23.2.
We see that the observed value of the test statistic is 8.38. The P-value is the area under the F(3,16) pdf to the right of 8.38; vassarstats tells us that this area is 0.001413.
For completeness, let me provide you with two websites that will compute areas under an F pdf. The first is an option on an old friend of ours:
http://stattrek.com/online-calculator/f-distribution.aspx.
(We use stattrek for binomial, Poisson and t curve probabilities.) The second option is:
http://www.danielsoper.com/statcalc3/calc.aspx?id=39
Both of these sites have the annoying feature of giving areas to the left; this is strange because statisticians (almost) always want areas to the right! In any event, if you go to either of these sites and type 3 and 16 for the degrees of freedom (Be careful! Don’t reverse them) and f = 8.38 (x = 8.38 on the second site), the site will give you the area under the F(3,16) curve to the left of 8.38. The stattrek site gives the answer 0.999 and the daniel . . . site gives the much more precise value 0.99858741. If you subtract either of these from one—to convert area to the left to area to the right—you will obtain, except for round-off error, the P-value given by vassarstats.
23.4 Estimation after Testing
I will continue to use our artificial data to illustrate the various confidence interval formulas.
628

Figure 23.2: The Four Sample Means for the Generated Data in Table 23.1. Recall that y¯1,· = 159; y¯2,· = 148; y¯3,· = 134; and y¯4,· = 95.

4

3 21

80

100

120

140

160

180

The picture, and variations on it, in Figure 23.2 will prove useful in this chapter. We now have a visual representation of the relative sizes of the four sample means.
One of the major advantages of analyzing artificially generated data is that we can play the roles of both researcher and Nature. As Nature, I know the population means:

µ1 = 160; µ2 = 140; µ3 = 120; and µ4 = 100.

Thus, we can see that the sample means overestimate the population means in populations 2 and 3, and underestimate the population means in populations 1 and 4, although the discrepancy in population 1 seems trivial.
Researchers sometimes overlook the importance of the sample variances. As Nature, I know that the population variance is σ2 = 400 in every population, but the sample variances:

s21

=

5

9

5

;

s

2 2

=

5

4

8

;

s

2 3

=

235.5;

and

s24

=

484,

seem to vary wildly. The variation seems less wild if we switch to our favored standard deviations:

s1 = 24.4; s2 = 23.4; s3 = 15.3; and s4 = 22.0.

The mathematical justification for our analyses relies on the assumption that all k populations have a common value for the population variance. This is troublesome for researchers for two reasons:

1. As this example illustrates, even when the assumption is true, the sample variances frequently will appear to be wildly different.

2. While there is a formal hypothesis test for the null hypothesis that the population variances have a common value—called a chi-squared test—the P-values from this test are notoriously inaccurate if the populations are not Normal curves.

23.4.1 Confidence Intervals for a Population Mean
This subsection sounds like something you learned in Chapter 17; it is, with a new twist. 629

Let’s consider the first population. As a researcher, we know that it has an unknown mean, denoted by µ1. We have five observations from this population, with values given in Table 23.1; reproduced, and sorted, below:
132, 139, 158, 176, 190.
Recall also our summary statistics:

n1 = 5, y¯1,· = 159 and s1 = 24.4.

With n1 − 1 = 5 − 1 = 4 degrees of freedom, the values of t∗ for 95% and 99% confidence are 2.776 and 4.604, respectively. Thus, the 95% confidence interval estimate of µ1 is
√ 159.00 ± 2.776(24.4/ 5) = 159.00 ± 30.29.

Similarly, the 99% confidence interval estimate of µ1 is 159.00 ± 50.24. Although it is difficult to tell what anything means with generated data, it does seem that these
half-widths, 30.29 and 50.24, are extremely large. This turns out not to be a problem because nobody actually uses the method I just showed you!
Looking at Table 23.1 again, we see that my intervals above are based on using
√ s1 = s21 = 595 = 24.4

as our estimate of the population standard deviation σ.

Now we come to the key point.

Based on our model, the k = 4 populations have identical values for σ and, hence, Table 23.1

presents four independent estimates of σ2, namely the four values of s2i . It turns out that, math-

ematically, the best way to combine these four values is given by the MSE entry in the ANOVA

Table in Table 23.2, which is 465.62. As a result, we estimate the population standard deviation,

σ, by





s = MSE = 465.62 = 21.58.

In the current situation, s = 21.58 is a bit smaller than s1 = 24.4, but this is not why we prefer it. We prefer it because the estimate s has 16 degrees of freedom, four degrees from each of the four samples. With 16 degrees of freedom, the values of t∗ for 95% and 99% confidence are 2.120 and 2.921, both of which are much smaller than the earlier values of 2.776 and 4.604.
With the improved estimate of σ and the quadrupling of the degrees of freedom, we now obtain
√ 159.00 ± 2.120(21.58/ 5) = 159.00 ± 20.46

as the 95% confidence interval estimate of µ1. This new half-width, 20.46, is 32.5% smaller than the previous half-width, 30.29. This is an important improvement.
Note: Be careful. In the above com√putation we use 16 degrees of freedom to obtain the value of t∗, but remember, we still divide s by 5 because our sample mean is still based on 5 observations.
The new improved 99% confidence interval estimate of µ1 is 159.00 ± 28.19. This new halfwidth is 43.9% smaller than the old half-width, 50.24.

630
PopulationsDataPopulationChapterFreedom