MATH 365, Elementary Statistics

Lesson 6 : Elements of Sampling Distribution

Satya Mandal

6.3 Sampling Distribution of the Sample Proportion

Due Date: Visit the homework site.

6.1 Sampling Distribution

The goal of this course has been to develop methods and to use sample statistics t (or T) to estimate the population parameters θ. For example, to estimate the mean weight μ (the parameter) of the fish population in the nearest lake, you may catch a sample of fish and compute the mean weight x (the statistic) of this sample and declare it as an estimate for population mean μ.

Since t would only be an estimate of θ, there would be an error ε = |t - θ |. We would like this error ε to be small or within our tolerable (specified) limit ρ (say). Further, we would like this error ε to be within our tolerable limit ρ, more often than not. For example, we may require that the error ε should be within our tolerance at least 90 percent times (among all our trials).

In fact, our estimate t is a variable number and it varies each time we take a sample. We denote this variable by T. Whenever we have a sample, T has a value T=t.

Indeed, T is a random variable on the sample space of all the possible samples. Therefore, T has a probability distribution. Since T depends on samples, its probability distribution is called a sampling distribution. When we say that "that error ε should be within our tolerance ρ at least 90 percent times", we mean that

P(|T - θ | ≤ ρ) = .90 or more.

In particular, the sample means x of numerical data that we computed in Lesson 2 would be the observed values of a random variable X, corresponding to the sample data we had. Similarly, the sample variances s² that we computed in Lesson 2 would be the observed values of a random variable S². Each time you collect a sample (or data), the computed sample mean x (respectively, the variance s², standard deviation s) would be the value of the random variable X (respectively, S², S) for that sample.

Example. Suppose we want to study the height distribution of the U.S. population. Let X represent the height of the whole US population.

We collect sample of size n. The sample would be n numbers

x₁, x₂, …, x_n

representing the height of n individuals.

We shall consider the height x_i of the i^th individual as the observed value of a random variable X_i.
Here X_i is the notation for the height of the i^th member of the sample, which could be the height of anybody from population.
Therefore, these n measurements

x₁, x₂, …, x_n

are, respectively, the observed values of n random variables

X₁, X₂, …, X_n.

We (re)define the sample mean X as the random variable

X₁+X₂+…+X_n

We also (re)define sample variance S² as the random variable

S² =

n- 1

n
∑
i = 1

(X_i -

)².

So, the sample means that we computed before in Lesson 2 are the values of the random variable X.

We will consider the sampling distribution of the sample mean X. (In Lesson 7, we will briefly mention the sampling distribution of the sample variance S².)

Sampling Types

There are many ways to do sampling. Most commonly discussed among them are

Sampling without replacement,
Sampling with replacement.

The Sampling without replacement is the type of sampling where, whenever a sample member is selected, the member is excluded from the subsequent selections. It is analogous to selecting n balls from a box of N balls. Balls are selected one by one, without replacing them back in the box before subsequent selections. This type of sampling is meant to rule out the possibility of selecting a member more than once. For small populations, possibility of selecting a memebr twice may be significant. For such small populations, sampling without replacement would be appropriate.

The Sampling with replacement is the type of sampling where each selection is done without any regard to previous selections. In other words, each time a sample member is drawn, it is placed back to the whole population before the next selection is made. This way, each selection is done from the same whole population. A member could, therefore, be selected more than once. This may seem unnatural. But when working with large populations this is not likely to happen and is most natural from the statistical point of view. (How often would one receives calls twice for the same poll?) We will only consider sampling with replacement .

Properties

Let X (like height) be a random variable with mean μ and variance σ². Let X₁, X₂, … , X_n be a sample from the X - population. We assume that the sampling was done with replacement. Such a sample has the following properties.

X would be called the parent population or the population random variable. Also μ and σ² are called the population mean and variance.
Each of the sample member X_i has the same distribution as X. So, mean of X_i is μ and variance of X_i is σ².
The sample members X₁,X₂, …, X_n are all mutually independent. (In fact, one had to ensure that they are drawn independently. )
The distribution of X is called the sampling distribution of X.
Theorem. The mean of the sample mean X is the population mean μ, that is

E(X) = E(X) = μ

The variance of the sample mean X is given by

Var(X) = σ²/n

So, the standard deviation of X, denoted by σ _X, is given by

σ_X = σ/√n.
Definition. The standard deviation σ_X is also called standard error.

6.2 Central Limit Theorem

The following theorem describes the sampling distribution of the Sample Mean. It is called the Central Limit Theorem (CLT)

Theorem (CLT). Suppose X₁,X₂, …,X_n is a sample from a population X with mean μ and variance σ².

Assume n is large. Then the sample mean X is, approximately, distributed as

N(μ,σ_X)

where σ_X= σ/√n.

Therefore, approximately,

P(a < X <b)=P(L < Z < R)

where L=(a-μ)/σ _X and R=(b-μ)/σ _X

P(a <

< b) = P

a- μ

σ/√n

< Z <

b-μ

σ/√n

Further, if the parent population X is Normal, then 1) and 2) are exact.

Standard Error and Precision

The standard error σ _X= σ /√n decreases to zero, as the sample size increases. Because of this, while estimating the mean μ, the sample mean X can simultaneously achieve precision and level of confidence (i.e. probability of a give precision), by increasing the sample size n. The following animation is a demonstration of the same.

Animation 6.2.1

Problem Solving: The Central Limit Theorem (CLT) would be used to compute approximate probability for the sample mean X. This would be similar to normal approximation to Binomial (Section 5.3). Steps we follow would be

The mean

μ_X=μ

of X remains the same as the population mean μ.
First, compute standard deviation of

σ_X

of X.
Standardize
Use the normalcdf function of TI-84.

Problems on 6.2: Central Limit Theorem

Exercise 6.2.1. It is known that the tuition paid per semester by students in a university has a distribution with mean $2,050 and standard deviation $310. If 64 students are interviewed, what is the approximate probability that the sample mean tuition paid will be above $2,060?

Solution:
Here the population mean μ = 2,050 and the population standard deviation σ = 310. The sample size n = 64.
First step is to compute the mean
μ_X =μ = 2050
and the standard deviation
σ_X = σ/ √n = 310/√64=38.75
Let X = Tuition paid by the students. Then, the distribution of X is, approximately, N(2050, 38.75)

Now " X will be above 2060" means " X > 2060".
P(2060 < X)
= P([2060 - μ]/σ_X < [X - μ]/ σ_X )
≈ P([2060 - μ]/σ _X < Z )
= P([2060 - 2050]/38.75 < Z )
= P (.2580, < Z ) = normalcdf(.2580, 5)= .3982

Exercise 6.2.2.
The monthly water consumption X per household in a subdivision in Kansas City has normal distribution with mean 15000 gallons and standard deviation 3000 gallons. What is the probability that the mean consumption of the 44 households in the subdivision will exceed 16000 gallons?

Solution:
Here the population mean μ = 15000 and the population standard deviation σ = 3000. The sample size n = 44.
First step is to compute the mean
μ_X =μ = 15000.
and the standard deviation
σ_X = σ/ √n = 3000/√44=452.2670
Let X = monthly water consumption by the households. Then, the distribution of X is, approximately, N(15000, 452.2670)

Now " X will exceed 16000" means " X > 16000".
P(16000 < X)
= P([16000 - μ]/σ _X < [X - μ]/ σ_X )
≈ P([16000 - μ]/σ _X < Z )
= P([16000-15000]/452.2670 < Z )
= P (2.2111, < Z ) = normalcdf(2.2111, 5)= .0135

Exercise 6.2.3. In a class of more than thousand students, the instructor announced after a test that the mean score was μ = 77 point and standard deviation σ = 24 points. You took a sample of 81 students. What would be the approximate probability that the sample mean would be less than 80?

Solution:
Here the population mean μ = 77 and the population standard deviation σ = 24. The sample size n = 81.
First step is to compute the mean
μ_X =μ = 77.
and the standard deviation
σ_X = σ/ √n = 24/√81= 2.6667
Let X = Points scored by students. Then, the distribution of X is, approximately, N(77, 2.6667)

Now "the sample mean would be less than 80" means " X < 80".

P(X < 80)
= P([X - μ]/ σ_X < [80 - μ]/σ _X)
≈ P(Z < [80 - μ]/σ_X < Z )
= P(Z < [80 -77]/2.6667)
= P (Z < 1.1250) = normalcdf(-5, 1.1250)= .8697

Exercise 6.2.4. The mean salary X of the university professors in a state is μ = $65,000 and standard deviation σ = $14,000. You collect a sample of 75 professors. What is the probability that sample mean salary of these 75 professors would be above $60,000.

Solution:
Here the population mean μ = 65000 and the population standard deviation σ = 14000. The sample size n = 75.
First step is to compute the mean
μ_X =μ = 65000.
and the standard deviation
σ_X = σ/ √n = 14000/√75=1616.5808
Let X = monthly water consumption by the households. Then, the distribution of X is, approximately, N(65000, 1616.5808)

Now " X would be above 60,000" means " X > 60000".

P(16000 < X)
= P([60000 - μ]/σ_X < [X - μ]/ σ_X )
≈ P([16000 - μ]/σ_X ; < Z )
= P([60000-65000]/1616.5808 < Z )
= P (-3.0929, < Z ) = normalcdf(-3.0929, 5)= .9990

Exercise 6.2.5.The time X that a child spends watching TV on week- ends has a normal distribution with mean μ = 330 minutes and standard deviation σ = 95 minutes. You sample 50 kids in a school. What is the probability that the sample time X that these kids watch TV on a weekend will be less than 300 minutes.

The Following Problems are Posed in terms of the Total

Exercise 6.2.6. The weight X of fish in a lake has mean μ = 12 pounds and standard deviation σ = 4.5 pounds. Suppose you catch 150 fish. What is the probability that total weight of fish will be less than 1900 pounds?

Solution:
Here the population mean μ = 12 and the population standard deviation σ = 4.5. The sample size n = 150.
First step is to compute the mean
μ_X =μ = 12.
and the standard deviation
σ_X = σ/ √n = 4.5/√150= .3674
Let X = Points scored by students. Then, the distribution of X is, approximately, N(12, .3674)

The problem is posed in terms of Total weight of all the fish. The sample mean X= Total/n.

Now " Total weight will be less than 1900 pounds" means that "Total < 1900".
This means " X =Total/n < 1900/n =1900/150 =12.6667".
P(X < 12.6667)
= P([X - μ]/ σ_X < [12.6667 - μ]/σ _X)
≈ P(Z < [12.6667 - μ]/σ_X)
= P(Z < [12.6667 - 12]/.3674)
= P (Z < 1.8146) = normalcdf(-5, 1.8146)= .9652

( Well, you are fairly sure (96 percent sure) that you did not catch 1900 pounds. )

Exercise 6.2.7. The amount X of water used when a person takes a shower has a mean μ = 30 gallons and standard deviation σ = 16 gallons. Suppose 36 people take a shower in a swimming pool facility. What is the probability that total of more than 900 gallons of water will be used by these 36 people.

Solution:
Here the population mean μ = 30 and the population standard deviation σ = 16. The sample size n = 36.
First step is to compute the mean
μ_X =μ = 30.
and the standard deviation
σ_X = σ/ √n = 16/√36=2.6667
Let X = water used when a person takes a shower.
Then, the distribution of X is, approximately, N(30, 2.6667)

The problem is posed in terms of Total weight of a ll the fish. The sample meanX = Total/n.

Now " total of more than 900 gallons of water will be used" means that "Total > 900".
This means" X =Total/n > 900/n = 900/36 = 25".

P(25 < X)
= P([25 - μ]/σ_X < [X - μ]/ σ_X )
≈ P([25 - μ]/σ _X ; < Z )
= P([25 - 30]/2.6667 < Z )
= P (-1.8750, < Z ) = normalcdf(-1.8750, 5)= .9696

( Well, you are fairly sure (96 percent sure) that more than 900 gallons will be used up. )

Exercise 6.2.8. The waiting time for the campus bus has a mean μ= 7 minutes and the standard deviation σ = 2 minutes. A student used the bus 120 times in a month. What is the probability that the student would have waited more than 900 minutes during the whole month?

Solution:
Here the population mean μ =7 and the population standard deviation σ = 2. The sample size n = 120.
First step is to compute the mean
μ_X =μ =7.
and the standard deviation
σ_X = σ/ √n = 2/√120= .1826
Let X = waiting time for the bus.
Then, the distribution of X is, approximately, N(7, .1826)

The problem is posed in terms of Total weight of a ll the fish. The sample meanX = Total/n.

Now " total of more than 900 minutes will be spent" means that "Total > 900".
This means" X =Total/n > 900/n = 900/120 = 7.5".

P(7.5 < X)
= P([7.5 - μ]/σ_X < [X - μ]/ σ_X )
≈ P([7.5 - μ]/σ _X ; < Z )
= P([7.5 - 7]/.1826 < Z )
= P (2.7382, < Z ) = normalcdf(2.7382, 5)= .0031

( Well, the chances are fairly low that you will spend that kind of time waiting for the bus. )

Exercise 6.2.9. According to some data, the annual Kansas wheat export X has a mean 733 million dollars and standard deviation 163 million dollars. What is the probability that over the next 10 years Kansas wheat exports will exceed 8040 million dollars?

Solution:
Here the population mean μ = 733 and the population standard deviation σ = 163. The sample size n = 10.
First step is to compute the mean
μ_X =μ = 733.
and the standard deviation
σ_X = σ/ √n = 163/√10= 51.5451
Let X = Kansas wheat export annually. Then, the distribution of X is, approximately, N(733, 51.5451)

The problem is posed in terms of Total export in 10 years. Note that the sample mean X= Total/n.

Now " Total export will exceed 8040" means that "Total > 8040".
This means " X =Total/n > 8040/n =8040/10 =804".
P(804 < X)
= P([804 - μ]/σ_X < [X - μ]/ σ_X )
≈ P([804 - μ]/σ < Z )
= P([804 - 733]/51.5451 < Z )
= P (1.3774, < Z ) = normalcdf(1.3774, 5)= .0842

6.3 Sampling Distribution of the Sample Proportion

Suppose you are a statistical quality control (SQC) officer in a lamp factory. Your job would include estimating proportion p of the defective lamps. When you test a lamp, it is a Bernoulli(p)-trial. Correspondingly, a Bernoulli(p) random variable X is define as follows:

X=1 if success (i. e. defective)
X=0 if failure (i. e. not defective)

The goal of this course would also be to develop methods to estimate p, which is a parameter of this Bernoulli(p) random variable X. From Lesson 4, the mean μ and standard deviation σ of X are given by

μ = p σ =√p(1-p).

As usual, we will use a sample mean to estimate the mean μ = p. So, we take a sample X₁,X₂, …, X_n of size n from this Bernoulli( p) population and X_i represents the outcome of testing the i^th lamp as follows:

X_i=1 if i^th sample is a success (i.e. the i^th sample is defective)
X_i=0 if i^th sample is a failure (i.e. the i^th sample is not defective).

An estimator of the mean μ = p would be the sample mean

X =(X₁+X₂+… +X_n)/n =T/n where we write T =(X₁+X₂+… +X_n).

Since X_i is 1 or 0 according as the i^th trial is success or failure (i.e. the i^th sample lamp is defective or not),

T = X₁+X₂+… +X_n = the total Number of Success in these n trials

and the sample mean

X =T/n = the Sample Proportion of Success (i.e = the number of defective lamps) in these n trials.

To estimate p by the Sample Proportion of Success X, knowledge of its sampling distribution would be required. By CLT, when n is large, the distribution of the sample proportion of success X, is approximately

N(p,σ_X) where σ_X=√p(1-p)/n.

There is obviously nothing special about the testing lamps and estimating proportion p of defective lamps produced in the factory. The above applies to any situation of Bernoulli(p)-trials. Other examples would be estimating (1) proportion p of the voter population who favors a particular candidate, (2) proportion p of the population who has asthma (3) proportion p of the seeds of a variety that germinates in a particular situation, (4) proportion p of the population who benefit from a particular vaccine, (5) proportion p of the population who live beyond 70.

The following theorem summarizes that the above discussion and describles the sampling distribution of the sample proportion of success.

Theorem. Let p be the proportion of a population with a certain attribute. Out of a sample of size n, suppose T have the attribute (or is the number of success) and X=T/n is the proportion of success. If n is large and p is not too close to 0 or 1, then the distribution of the sample proportion of success X, is approximately

N(p,σ _X) where σ _X =√p(1-p)/n. (Obviously, the mean of X, μ_X = p.)

Therefore,

P(a < X <b)=P(L < Z < R)

where L=(a - p)/ σ_X and R=(b - p)/σ_X

Remark. As we discussed in Lesson 4, the number of success X has a Binomial(n, p)-distribution. We also used normal approximation to Binomial in Section 5.3. The normal approximation of the sample proportion of success X given above is not really different from the normal approximation of the binomial random variable. The main difference is that we have an eye to use the distribution of X to estimate p. For problem solving, another difference is that we ignore continuity correction. For large n, continuity correction is, in fact, negligible and will not have any effect.

Problems on 6.3: Sample Proportion

Exercise 6.3.1. According to a report entitled "Pediatric Nutrition Surveillance" published by Centers for Disease Control (CDC) 18 percent of the children younger than two had anemia in 1997. On a particular day in that year, a pediatrician examined 180 children. What is the probability that the proportion will exceed 0.20? ( Equivalently, find the probability that the number T of children with anemia would exceed 180*.2 = 36.)

Solution:
Here the population mean p = .18 and the sample size n = 180.
First step is to compute the mean
μ_X =p = .18
and the standard deviation of X

σ_X = √p(1-p)/n = √.18(1- .18)/180 = .028636
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion of patients with anemia.
The distribution of X is, approximately, N(.18, .028636)

Now " X will exceed 0.20" means X > .20".

P(.20 < X)
= P([.20 - p]/σ_X < [X - p]/ σ_X )
≈ P([.20 - p]/σ _X ; < Z ) [ The Standardization Step. ]
= P([.20 - .18]/.028636 < Z )
= P (.6984, < Z ) = normalcdf(.6984, 5)= .2625

Exercise 6.3.2. In 1996, the House of Representatives impeached President Clinton. As a part of the political discourse, numerous polls were conducted and reported. One poll claimed that 75 percent of eligible voters think the President should not be impeached. Suppose 700 voters were interviewed. Assuming the claim, what would be the probability that less than 72 percent (in this sample of 700) would have thought the President should not be impeached. ( Equivalently, find the probability that less than .72*700 = 504 voters would have thought the President should not be impeached .)

Solution:
Here the pop ulation mean p = .75 and the sample size n = 700.
First step is to compute the mean
μ_X =p = .18
and the standard deviation of X
σ_X = √p(1-p)/n = √.75(1-.75)/700 = .016366
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion of voters who thought that the President should not be impeached.
The distribution of X is, approximately, N(.75, .016366)

Now " less than 72 percent would have thought the President should not be impeached" means that " X < .72".
P(X < .72)
= P([X - p]/ σ_X < [.72 - p]/σ _X)
≈ P(Z < [.72 - p]/σ_X) [ The Standardization Step. ]
= P(Z < [.72 - .75]/.016366)
= P (Z < -1.8331) = normalcdf(-5, -1.8331)= .0334

Exercise 6.3.3. It is believed proportion of voters (in a county) who vote by absentee ballot is p=.22. You sample 725 voters. Compute an approximate the probability the sample proportion of absentee votes will exceed 25 percent. ( Equivalently, find the probability that the number of absentee votes will exceed 725*.22 = 159.5.)

Solution:
Here the pop ulation mean p = .22 and the sample size n = 725.
First step is to compute the mean
μ_X =p = .22
and the standard deviation of X

σ_X = √p(1-p)/n = √.22(1-.22)/725 = .015385
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion of absentee votes.
The distribution of X is, approximately, N(.22, .015385)

Now "sample proportion of absentee votes will exceed 25 percent" " means X will exceed 0.25". That means X > .25".

P(.25 < X)
= P([.25 - p]/σ_X < [X - p]/ σ_X )
≈ P([.25 - p]/σ _X ; < Z ) [ The Standardization Step. ]
= P([.25 - .22]/.015385 < Z )
= P (1.9500, < Z ) = normalcdf(1.9500, 5)= .0256

Exercise 6.3.4. It is believed that 35 percent of the population in a county shop in health food market. If you sample 800 individuals, what would be an approximate the probability the sample proportion of those who shop in health food market exceed 40 percent. ( Equivalently, find the probability that the number T of those who shop in health food market would exceed 800*.40 = 320.)

Solution:
Here the pop ulation mean p = .35 and the sample size n = 800.
First step is to compute the mean
μ_X =p = .35

and the standard deviation of X

σ_X = √p(1-p)/n = √.35(1-.35)/800 = .016863
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion those who shop in health food market.
The distribution of X is, approximately, N(.35, .016863)

Now "sample proportion of those who shop in health food market will exceed 40 percent" " means X will exceed 0.40". That means X > .40".

P(.40 < X)
= P([.40 - p]/σ_X < [X - p]/ σ_X )
≈ P([.40 - p]/σ _X ; < Z ) [ The Standardization Step. ]
= P([.40 - .35]/.016863 < Z )
= P (2.9651, < Z ) = normalcdf(2.9651, 5)= .0015

Exercise 6.3.5. It is known that a vaccine may cause fever as side effect, after one takes the shot. The producer of the vaccine claims that only 17 percent of those who take the shot experience such side effects. You sample 978 individuals who took the shot. What would be an approximate probability that more than 15 percent would experience side effect? ( Equivalently, find the probability that more than .15*978 =146.7 would experience side effect.)

Solution:
Here the pop ulation mean p = .11 and the sample size n = 978.
First step is to compute the mean
μ_X =p = .17

and the standard deviation of X

σ_X = √p(1-p)/n = √.17(1-.17)/978 = .012011
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion those who experienced sided effect.
The distribution of X is, approximately, N(.17, .012011)

Now "more than 15 percent would experience side effect " means X will be more than .15". That means X > .15".

P(.15 < X)
= P([.15 - p]/σ_X < [X - p]/ σ_X )
≈ P([.15 - p]/σ _X ; < Z ) [ The Standardization Step. ]
= P([.15 - .17]/.012011 < Z )
= P (2.9651, < Z ) = normalcdf(-1.6651, 5)= .9521

Exercise 6.3.6. About 27 percent of the population take flu shots. You are in a class of 750 students. Compute an approximate the probability the sample proportion of those who took the shot would be less than 25 percent. ( Equivalently, find the probability that the number T of those who took the shot would be less than .25*750= 167.5.)

Exercise 6.3.7. It is known that 78 percent of the microwave ovens last more than five years. A SQC inspector sampled 600 microwaves. What would be the approximate probability that more than 78 percent of this sample would last more than five years? ( Equivalently, find the probability that more than .78*600 = 468 of this sample would last more than five years.)

Math 365, Elementary Statistics

Lesson 6 : Elements of Sampling Distribution

6.1 Sampling Distribution

Sampling Types

Properties

6.2 Central Limit Theorem

Standard Error and Precision

6.3 Sampling Distribution of the Sample Proportion