# The variance of the variance of samples from a finite population

## Transcript Of The variance of the variance of samples from a finite population

The variance of the variance of samples from a ﬁnite population

Eungchun Cho, Kentucky State University∗ Moon Jung Cho, Bureau of Labor Statistics

John Eltinge, Bureau of Labor Statistics

Key Words: Sample Variance; Randomization Variance; Polykays; Moments of Finite Population

Abstract A direct derivation of the randomization variance of the sample variance V (x) and related formulae are presented. Examples of the special cases of uniformly distributed population are given.

1 Introduction

Introductory courses in the randomization approach to survey inference generally begin with a relatively parsimonious development based on withoutreplacement selection of simple random samples. For an arbitrary ﬁnite population one establishes the unbiasedness of the sample mean x for the corresponding ﬁnite population mean; evaluates the randomization variance of the sample mean; and develops as unbiased estimator, V (x) for this randomization variance. One subsequently uses similar developments for related randomized designs, e.g., stratiﬁed random sampling and some forms of cluster sampling. In applications of this material to practical problems, it is often important to evaluate V (V (x)), the variance of the variance estimator. For example, some cluster sample designs may be considered problematic if the resulting V (x) is unstable, i.e., has an unreasonably large variance.

This note presents a relatively simple, direct derivation of the randomization variance of V (x) and related quantities. This derivation is pedagogically appealing because it builds directly on the standard whole sample approach used in introductory texts like Cochran [1], and does not require students to work with the more elaborate ”polykay” approach used by Tucky [5], and Wishart [8],

∗[email protected]

1

2 Functions on Simple Random Samples

Consider a ﬁnite population of N numbers A = [a1, a2, . . . , aN ]. Let Ln,A be

the list of all possible samples of n elements selected without replacement from

A.

N

N!

Ln,A = [S1, S2, . . . , Sα] , α = n = n!(N − n)! (1)

One selects a without-replacement simple random sample of size n from A by selecting one element from Ln,A in such a way that each sample Sj has probability 1/α of being selected. Consider a function f on Ln,A, that is, f assigns a each sample S ∈ Ln,A a value f (S). Two prominent examples of f are the sample mean and the sample variance:

1 a(S) = n ai (2)

ai ∈S

1 v(S) =

{ai − a(S)}2

(3)

n−1

ai ∈S

Evaluation of the randomization properties of f (S) for S ∈ Ln,A is conceptu-

ally straightforward. For example, E{f (S)}, the expectated value of f (S), is obtained by computing its arithmetic average taken over the Nn equally likely samples in Ln,A:

1

E{f (S)} = N

f (S)

(4)

n S∈Ln,A

V {f (S)}, the variance of f (S), is deﬁned to be the expectation of the squared deviations [f (S) − E{f (S)}]2 ,

1 V {f (S)} =

[f (S) − E{f (S)}]2

(5)

N

n S∈Ln,A

3 The Variance of the Sample Variance

Routine arguments (e.g., Cochran [1, Theorems 2.1, 2.2 and 2.4]) show

E{a(S)} = A

(6)

E{v(S)} = V (A)

(7)

V {a(S)} = 1 − Nn V (A) (8) n

where a(S) is the mean of the sample S, v(S) is the variance of the sample S,

N

A = ai/N , the mean of A, and

i=1

1N

2

V (A) = N − 1 ai − A (9)

i=1

is the full ﬁnite-population analogue of the sample variance v(S). The principal task in this paper is to obtain a relatively simple expression (formula) for the variance of v(S),

1 V {v(S)} =

{v(S) − V (A)}2

(10)

N

n S∈Ln,A

in terms of ai’s in the underlying population. The formula will be useful for estimating the variance of the variance when the straight forward computation by the deﬁnition is practically impossible due the combinatorial explosion.

4 The Main Result

The following theorem gives a formula for the variance of the sample variance

under simple random sampling without replacement.

Theorem. Let A = [a1, a2, . . . , aN ] be a list of N numbers, (N ≥ 4). Let Ln,A be the list of all possible samples of n numbers selected without replacement from

A, ( 2 ≤ n ≤ N − 1 ).

Ln,A = [S1, S2, . . . , Sα]

where α =

N n

= N !/n!(N − n)!, Si ⊂ A, and |Si| = n Let Vn be the list of the

sample variance v(Si) for each Si ∈ Ln,A,

Vn = [v(S1), v(S2), . . . , v(Sα)]

Then E [v(S) − E{v(S)}]2 , the variance of the variances of all the samples of A of size n, is given by

N

V {v(S)} = C1 a4i + C2 a3i aj + C3 a2i a2j +

i=1

i=j

i

+ C4

a2i aj ak + C5

aiaj akal

i=j,i=k

j

i

where

N −n C1 = N 2n

N −n C2 = −4 N 2 (N − 1) n

N (N − 1)((n − 1)2 + 2) − n(n − 1)((N − 1)2 + 2)

C3 = 2

N 2 (N − 1)2 n (n − 1)

N (N − 1)(n − 2)(n − 3) − n(n − 1)(N − 2)(N − 3)

C4 = 4

N 2 (N − 1)2 (N − 2) n (n − 1)

N (N − 1)(n − 2)(n − 3) − n(n − 1)(N − 2)(N − 3)

C5 = 24

N 2 (N − 1) (N − 2) (N − 3)2 n (n − 1)

Sketch of Proof. The proof involves determining the coeﬃcients of all the fourth degree terms ai4, ai2aj 2, ai2aj ak, ai3aj , and aiaj akal that appears in the summation. The summations in the formula are such that all like terms are combined thus appear only once. For example, aiajakal appears only for the indices arranged in increasing order i < j < k < l. Recall that α is the number of without-replacement samples S of A of size n and the variance of the variances of the samples S of A of size n is

{v(S) − V (A)}2 α

= S v(S)2 − V (A)2 α

where the sum is taken over all the samples S in Ln,A. For the second term, S∈Ln,A V (A)2. it follows that

V (A)2 =

a4i + 2 a2i a2j

i=j a3i aj + i=j,i=k a2i aj ak

i

i

j

N2

− 4

N 2 (N − 1)

+

i

+4

j

N 2 (N − 1)2

Similarly, for the term S v(S)2 it follows

a4i + 2

a2i a2j

ai ∈S

i

i=j a3i aj + i=j,i=k,j

v(S)2 =

ai,aj ∈S

− 4 ai,aj ∈S

ai,aj ,ak ∈S

+

n2

n2 (n − 1)

i

a2i a2j + 2

a2i aj ak + 6

i=j,i=k,j

i

aiajakal

ai,aj ∈S

ai,aj ,ak ∈S

ai,aj ,ak ,al∈S

+ 4

n2 (n − 1)2

Each term in

v(S)2 =

S∈Ln,A

S v(S)2 is transformed to give

N −1 n−1

a4i + 2

N −2 n−2

a2i a2j

i i

n2

4 Nn−−22 i=j a3i aj + 4 Nn−−33 i=j,i=k a2i aj ak

−

j

+

n2 (n − 1)

4 Nn−−22 i

+

j

n2 (n − 1)2

Simpliﬁcation of the binomial coeﬃcients leads to,

1

v(S)2 =

α

S∈Ln,A

a4i

a2i a2j

a3i aj

i + 2 (n − 1) i

− 4 i=j

nN

n (N − 1) N n (N − 1) N

a2i aj ak

i=j,i=k

− 4 (n − 2)

j

+

n (N − 2) (N − 1) N

a2i a2j

a2i aj ak

i=j,i=k

+ 4 i

n (n − 1) (N − 1) N

n (n − 1) (N − 2) (N − 1) N

aiaj akal

+ 24 (n − 2) (n − 3)

i

n (n − 1) (N − 3) (N − 2) (N − 1) N

Substitution of the expressions of V (A)2 and of v(S)2 into the expression of the variance of the samples S of A leads to the result of the theorem.

The formula for V {v(S)} becomes considerably simpler for the population of a discrete uniform distribution on a ﬁnite interval. In this case, A is a ﬁnite arithmetic sequence. This occurs, for example, in the important special cases of equal-probability systematic sampling (Cochran [1], Chapter 8). Corollary 1. Let A = [1, 2, . . . , N ], N ≥ 3. Let S, Ln,A and v(S) be as in the theorem above. The variance of the sample variances

N (N + 1)(N − n) (2 nN + 3n + 3N + 3)

V {v(S)} =

(11)

360 n(n − 1)

For a more general arithmetic sequence, Corollary 2. Let A = [a0, a0 + d, . . . , a0 + (N − 1) d], N ≥ 3. Then the variance of v(S),

V {v(S)} = N (N + 1)(N − n) (2 nN + 3n + 3N + 3) d4 360 n(n − 1)

Corollary 3. Let A be a list of numbers uniformly distributed on the interval [1/N, 1], A = N1 , N2 , . . . , NN−1 , NN , N ≥ 3. The variance of v(S),

V {v(S)} = (1 + N1 )(1 − Nn ) 2 n + 3 + 3Nn + N3 (12) 360 n(n − 1)

V {v(S)} approaches 2 n + 3/360 n (n − 1) as N approaches ∞. For simplest two special cases when n = 2 and n = N − 1, we have

V {v(S)} = (1 + N1 )(1 − N2 )(7 + N9 ) , if n = 2 (13) 720

and

V {v(S)} = (1 + N1 )(1 + N2 ) , if n = N − 1 (14) 180

References

1. W. G. Cochran, Sampling Techniques (3rd ed. ), John Wiley, 1977. 2. R. L. Graham, D. E. Knuth and O. Patashnik, Concrete Mathematics, Addison-Wesley, 1989. 3. J. W. Tukey, Some sampling simpliﬁed, Journal of the American Statistical Association, 45 (1950), 501-519. 4. J. W. Tukey, Keeping moment-like sampling computation simple, The Annals of Mathematical Statistics, 27 (1956), 37-54. 5. J. W. Tukey, Variances of variance components: I. Balanced designs. The Annals of Mathematical Statistics, 27 (1956), 722-736. 6. J. W. Tukey, Variances of variance components: II. Unbalanced single classiﬁcations. The Annals of Mathematical Statistics, 28 (1957), 43-56. 7. J. W. Tukey, Variance components: III. The third moment in a balanced single classiﬁcation. The Annals of Mathematical Statistics, 28 (1957), 378-384. 8. J. Wishart, Moment Coeﬃcients of the k-Statistics in Samples from a Finite Population. Biometrika, 39 (1952), 1-13.

Eungchun Cho, Kentucky State University∗ Moon Jung Cho, Bureau of Labor Statistics

John Eltinge, Bureau of Labor Statistics

Key Words: Sample Variance; Randomization Variance; Polykays; Moments of Finite Population

Abstract A direct derivation of the randomization variance of the sample variance V (x) and related formulae are presented. Examples of the special cases of uniformly distributed population are given.

1 Introduction

Introductory courses in the randomization approach to survey inference generally begin with a relatively parsimonious development based on withoutreplacement selection of simple random samples. For an arbitrary ﬁnite population one establishes the unbiasedness of the sample mean x for the corresponding ﬁnite population mean; evaluates the randomization variance of the sample mean; and develops as unbiased estimator, V (x) for this randomization variance. One subsequently uses similar developments for related randomized designs, e.g., stratiﬁed random sampling and some forms of cluster sampling. In applications of this material to practical problems, it is often important to evaluate V (V (x)), the variance of the variance estimator. For example, some cluster sample designs may be considered problematic if the resulting V (x) is unstable, i.e., has an unreasonably large variance.

This note presents a relatively simple, direct derivation of the randomization variance of V (x) and related quantities. This derivation is pedagogically appealing because it builds directly on the standard whole sample approach used in introductory texts like Cochran [1], and does not require students to work with the more elaborate ”polykay” approach used by Tucky [5], and Wishart [8],

∗[email protected]

1

2 Functions on Simple Random Samples

Consider a ﬁnite population of N numbers A = [a1, a2, . . . , aN ]. Let Ln,A be

the list of all possible samples of n elements selected without replacement from

A.

N

N!

Ln,A = [S1, S2, . . . , Sα] , α = n = n!(N − n)! (1)

One selects a without-replacement simple random sample of size n from A by selecting one element from Ln,A in such a way that each sample Sj has probability 1/α of being selected. Consider a function f on Ln,A, that is, f assigns a each sample S ∈ Ln,A a value f (S). Two prominent examples of f are the sample mean and the sample variance:

1 a(S) = n ai (2)

ai ∈S

1 v(S) =

{ai − a(S)}2

(3)

n−1

ai ∈S

Evaluation of the randomization properties of f (S) for S ∈ Ln,A is conceptu-

ally straightforward. For example, E{f (S)}, the expectated value of f (S), is obtained by computing its arithmetic average taken over the Nn equally likely samples in Ln,A:

1

E{f (S)} = N

f (S)

(4)

n S∈Ln,A

V {f (S)}, the variance of f (S), is deﬁned to be the expectation of the squared deviations [f (S) − E{f (S)}]2 ,

1 V {f (S)} =

[f (S) − E{f (S)}]2

(5)

N

n S∈Ln,A

3 The Variance of the Sample Variance

Routine arguments (e.g., Cochran [1, Theorems 2.1, 2.2 and 2.4]) show

E{a(S)} = A

(6)

E{v(S)} = V (A)

(7)

V {a(S)} = 1 − Nn V (A) (8) n

where a(S) is the mean of the sample S, v(S) is the variance of the sample S,

N

A = ai/N , the mean of A, and

i=1

1N

2

V (A) = N − 1 ai − A (9)

i=1

is the full ﬁnite-population analogue of the sample variance v(S). The principal task in this paper is to obtain a relatively simple expression (formula) for the variance of v(S),

1 V {v(S)} =

{v(S) − V (A)}2

(10)

N

n S∈Ln,A

in terms of ai’s in the underlying population. The formula will be useful for estimating the variance of the variance when the straight forward computation by the deﬁnition is practically impossible due the combinatorial explosion.

4 The Main Result

The following theorem gives a formula for the variance of the sample variance

under simple random sampling without replacement.

Theorem. Let A = [a1, a2, . . . , aN ] be a list of N numbers, (N ≥ 4). Let Ln,A be the list of all possible samples of n numbers selected without replacement from

A, ( 2 ≤ n ≤ N − 1 ).

Ln,A = [S1, S2, . . . , Sα]

where α =

N n

= N !/n!(N − n)!, Si ⊂ A, and |Si| = n Let Vn be the list of the

sample variance v(Si) for each Si ∈ Ln,A,

Vn = [v(S1), v(S2), . . . , v(Sα)]

Then E [v(S) − E{v(S)}]2 , the variance of the variances of all the samples of A of size n, is given by

N

V {v(S)} = C1 a4i + C2 a3i aj + C3 a2i a2j +

i=1

i=j

i

+ C4

a2i aj ak + C5

aiaj akal

i=j,i=k

j

i

where

N −n C1 = N 2n

N −n C2 = −4 N 2 (N − 1) n

N (N − 1)((n − 1)2 + 2) − n(n − 1)((N − 1)2 + 2)

C3 = 2

N 2 (N − 1)2 n (n − 1)

N (N − 1)(n − 2)(n − 3) − n(n − 1)(N − 2)(N − 3)

C4 = 4

N 2 (N − 1)2 (N − 2) n (n − 1)

N (N − 1)(n − 2)(n − 3) − n(n − 1)(N − 2)(N − 3)

C5 = 24

N 2 (N − 1) (N − 2) (N − 3)2 n (n − 1)

Sketch of Proof. The proof involves determining the coeﬃcients of all the fourth degree terms ai4, ai2aj 2, ai2aj ak, ai3aj , and aiaj akal that appears in the summation. The summations in the formula are such that all like terms are combined thus appear only once. For example, aiajakal appears only for the indices arranged in increasing order i < j < k < l. Recall that α is the number of without-replacement samples S of A of size n and the variance of the variances of the samples S of A of size n is

{v(S) − V (A)}2 α

= S v(S)2 − V (A)2 α

where the sum is taken over all the samples S in Ln,A. For the second term, S∈Ln,A V (A)2. it follows that

V (A)2 =

a4i + 2 a2i a2j

i=j a3i aj + i=j,i=k a2i aj ak

i

i

j

N2

− 4

N 2 (N − 1)

+

i

+4

j

N 2 (N − 1)2

Similarly, for the term S v(S)2 it follows

a4i + 2

a2i a2j

ai ∈S

i

i=j a3i aj + i=j,i=k,j

v(S)2 =

ai,aj ∈S

− 4 ai,aj ∈S

ai,aj ,ak ∈S

+

n2

n2 (n − 1)

i

a2i a2j + 2

a2i aj ak + 6

i=j,i=k,j

i

aiajakal

ai,aj ∈S

ai,aj ,ak ∈S

ai,aj ,ak ,al∈S

+ 4

n2 (n − 1)2

Each term in

v(S)2 =

S∈Ln,A

S v(S)2 is transformed to give

N −1 n−1

a4i + 2

N −2 n−2

a2i a2j

i i

n2

4 Nn−−22 i=j a3i aj + 4 Nn−−33 i=j,i=k a2i aj ak

−

j

+

n2 (n − 1)

4 Nn−−22 i

+

j

n2 (n − 1)2

Simpliﬁcation of the binomial coeﬃcients leads to,

1

v(S)2 =

α

S∈Ln,A

a4i

a2i a2j

a3i aj

i + 2 (n − 1) i

− 4 i=j

nN

n (N − 1) N n (N − 1) N

a2i aj ak

i=j,i=k

− 4 (n − 2)

j

+

n (N − 2) (N − 1) N

a2i a2j

a2i aj ak

i=j,i=k

+ 4 i

n (n − 1) (N − 1) N

n (n − 1) (N − 2) (N − 1) N

aiaj akal

+ 24 (n − 2) (n − 3)

i

n (n − 1) (N − 3) (N − 2) (N − 1) N

Substitution of the expressions of V (A)2 and of v(S)2 into the expression of the variance of the samples S of A leads to the result of the theorem.

The formula for V {v(S)} becomes considerably simpler for the population of a discrete uniform distribution on a ﬁnite interval. In this case, A is a ﬁnite arithmetic sequence. This occurs, for example, in the important special cases of equal-probability systematic sampling (Cochran [1], Chapter 8). Corollary 1. Let A = [1, 2, . . . , N ], N ≥ 3. Let S, Ln,A and v(S) be as in the theorem above. The variance of the sample variances

N (N + 1)(N − n) (2 nN + 3n + 3N + 3)

V {v(S)} =

(11)

360 n(n − 1)

For a more general arithmetic sequence, Corollary 2. Let A = [a0, a0 + d, . . . , a0 + (N − 1) d], N ≥ 3. Then the variance of v(S),

V {v(S)} = N (N + 1)(N − n) (2 nN + 3n + 3N + 3) d4 360 n(n − 1)

Corollary 3. Let A be a list of numbers uniformly distributed on the interval [1/N, 1], A = N1 , N2 , . . . , NN−1 , NN , N ≥ 3. The variance of v(S),

V {v(S)} = (1 + N1 )(1 − Nn ) 2 n + 3 + 3Nn + N3 (12) 360 n(n − 1)

V {v(S)} approaches 2 n + 3/360 n (n − 1) as N approaches ∞. For simplest two special cases when n = 2 and n = N − 1, we have

V {v(S)} = (1 + N1 )(1 − N2 )(7 + N9 ) , if n = 2 (13) 720

and

V {v(S)} = (1 + N1 )(1 + N2 ) , if n = N − 1 (14) 180

References

1. W. G. Cochran, Sampling Techniques (3rd ed. ), John Wiley, 1977. 2. R. L. Graham, D. E. Knuth and O. Patashnik, Concrete Mathematics, Addison-Wesley, 1989. 3. J. W. Tukey, Some sampling simpliﬁed, Journal of the American Statistical Association, 45 (1950), 501-519. 4. J. W. Tukey, Keeping moment-like sampling computation simple, The Annals of Mathematical Statistics, 27 (1956), 37-54. 5. J. W. Tukey, Variances of variance components: I. Balanced designs. The Annals of Mathematical Statistics, 27 (1956), 722-736. 6. J. W. Tukey, Variances of variance components: II. Unbalanced single classiﬁcations. The Annals of Mathematical Statistics, 28 (1957), 43-56. 7. J. W. Tukey, Variance components: III. The third moment in a balanced single classiﬁcation. The Annals of Mathematical Statistics, 28 (1957), 378-384. 8. J. Wishart, Moment Coeﬃcients of the k-Statistics in Samples from a Finite Population. Biometrika, 39 (1952), 1-13.