MATH 365, Elementary Statistics

Lesson 6 : Sampling Distribution

Introduction

6.1 Central Limit Theorem and Sampling Distribution of the Proportion

Introduction

The sample mean x that we have computed in the previous chapters is, in fact, the observed value of a random variable X. Similarly, the sample variance s² that we have computed before is the observed value of a random variable S². Each time you collect a sample/data, the computed sample mean x is the value of the random variable X for this sample. This is explained in the following example.

Example. Suppose we want to study the height distribution of the U.S. population. We collect data of size n = 1713. We shall consider that height x_i of the i^th individual in this sample is, in fact, the observed value of a random variable X_i. Here X_i is the notation for height of the i^th member of the sample, which could be the height of any person from the whole U.S. population. When we finished collecting data we have n measurements

x₁, x₂, …, x_n.

They are, respectively, the observed values of n random variables

X₁, X₂, …, X_n.

We (re)define the sample mean X as the random variable

X₁+X₂+…+X_n

We also (re)define sample variance S² as the random variable

S² =

n- 1

n
∑
i = 1

(X_i -

)².

So, the sample mean we computed before in Lesson 2 is a value of X.

We also say that X₁, X₂, … , X_n is a sample from the population X = height of an American. We assume that our sampling was done with replacement. Such a sample has the following properties.

Let X = height of an American and let mean of X be μ and variance σ². Then X is called the parent or the population random variable. Also μ and σ² are called the population mean and variance.
Then, each of the sample member X_i has the same distribution as X. So, mean of X_i is μ and variance of X_i is σ².
The sample members X₁,X₂, …, X_n are all mutually independent.
The distribution of X is called the sampling distribution of X.
Theorem. The mean of the sample mean X is the population mean μ, that is

E(X) = E(X) = μ

The variance of the sample mean X is given by

Var(X) = σ²/n

So, the standard deviation of X, denoted by σ _X, is given by

σ_X = σ/√n.
Definition. The standard deviation σ_X is also called standard error.

Remark. In the above discussion, we have assumed that the sampling was done with replacement. That means that each time a sample member is drawn, it is placed back before we select the next member. A member could, therefore, appear more than once. Although this may seem unnatural, when we are working with a large population this is not likely to happen and is most natural from the statistical point of view. (How often would one receive calls twice for the same poll?)

The type of sampling where we do not place back the item selected before we select the next one is called sampling without replacement. Although many textbooks have a lengthy discussion of this concept, we will not emphasize it. All our samples are drawn with replacement and have the above properties.

6.1 Central Limit Theorem and Sampling Distribution of the Proportion

Central Limit Theorem

Suppose X₁,X₂, …,X_n is a sample from a population X with mean μ and variance σ².

Assume n is large.

Then the sample mean X is, approximately, distributed as

N(μ,σ_X)

where σ_X= σ/√n.

So, approximately,

P(a < X <b)=P(L < Z < R)

where L=(a-μ)/σ _X and R=(b-μ)/σ _X

P(a <

< b) = P

a- μ

σ/√n

< Z <

b-μ

σ/√n

If the parent population X is Normal, then 1) and 2) are exact.

Sampling Distribution of the Proportion

Suppose you are conducting a poll to determine the proportion p (or percentage) of people in favor of a certain presidential candidate. You interview a randomly selected sample of n voters. Then you let X be the number of people among these n voters who are in favor of the candidate. Then X/n is the proportion in this sample that are in favor of the candidate. We use this sample proportion X/n as an estimate for the proportion of the entire voter population that are in favor of the candidate. This is the number X/n that the pollsters report on TV every evening before the election.

Here p is the proportion of voters that are in favor of the candidate. So, X is a B(n,p) random variable. We have already seen (section 5.3 in lesson 5) that, approximately, X follows a N(μ, σ) distribution, where μ = np, σ = √(np(1-p)). From this it follows that the sample proportion X/n, approximately, has

N (p, σ) distribution
where σ =(p(1-p)/n)^1/2.

In fact, the same could be derived from the central limit theorem. Let

Y=1 if success
Y=0 if failure

Here by "success" we mean that the voter is in favor of the candidate. Then Y is a Bernoulli(p) random variable and the mean of Y is p and the variance(Y) = p(1-p). The response of each voter in the sample could thus be represented as a random variable as follows

X_i=1 if i^th sample is a success
X_i=0 if i^th sample is a failure

Then X₁,X₂, … , X_n is a sample from the Y- population, and the sample proportion

X/n = X =(X₁+X₂+… +X_n)/n

is the sample mean. So, by CLT the sample proportion X=X/n, approximately, has

N(p,σ) distribution

where σ=(p(1-p)/n)^1/2.

The final formulas regarding sample proportion X=X/n are as follows:

The mean μ and the standard deviation σ of X=X/n are given by

μ = p σ = (p(1-p)/n)^1/2.

So, approximately,

P(a < X <b)=P(L < Z < R)

where L=(a-μ)/σ and R=(b-μ)/σ

P (a < X/n < b ) = P

a- p

σ _X/n

< Z <

b- p

σ _X/n

Remark. The same thing applies when you are trying to estimate the proportion of success p. Some examples might be the proportion of defective items, the proportion of people in favor of capital punishment, the proportion of immigrants.

Remark. The normal approximation of the sample proportion given above is not really different from the normal approximation of the binomial random variable (section 5.3). The only difference is the way we use them. In section 5.3, we used continuity correction. For large n, continuity correction is, in fact, negligible and will not have any effect.

Problems on 6.1: Central Limit Theorem and Sampling Distribution of the Proportion

Problems on Central Limit Theorem:

Exercise 6.1.1. It is known that the tuition paid per semester by students in a university has a distribution with mean $2,050 and standard deviation $310. If 64 students are interviewed, what is the approximate probability that the sample mean tuition paid will be above $2,060?
Solution

Exercise 6.1.2.
The monthly water consumption X per household in a subdivision in Kansas City has normal distribution with mean 15000 gallons and standard deviation 3000 gallons. What is the probability that the mean consumption of the 44 households in the subdivision will exceed 16000 gallons?
Solution

Exercise 6.1.3. According to some data, the annual Kansas wheat export X has a mean 733 million dollars and standard deviation 163 million dollars. What is the probability that over the next 10 years Kansas wheat exports will exceed 8040 million dollars?
Solution

Problems on Population Proportion:

Exercise 6.1.4. According to a report entitled "Pediatric Nutrition Surveillance" published by Centers for Disease Control (CDC) 18 percent of the children younger than two had anemia in 1997. On a particular day in that year, a pediatrician examined 180 children.

What is the expected (sample) proportion of children with anemia?
What is the variance of the sample proportion of children with anemia?
What is the probability that the proportion will exceed 0.20?

Solution

Exercise 6.1.5. On one day during an impeachment hearing, it is claimed that 75 percent of eligible voters think the President should not be impeached. Suppose we interview 700 voters. Assuming the above, what is the probability that the sample proportion of voters who do not think the President should be impeached

is less than .73?
is less than .70?
is less than .60?

Solution

Math 365, Elementary Statistics

Lesson 6 : Sampling Distribution

Introduction

6.1 Central Limit Theorem and Sampling Distribution of the Proportion