Lesson 6 : Elements of Sampling Distribution
Satya Mandal
Due Date: Visit the homework site.
6.1 Sampling Distribution
The goal of this course has been to develop methods and to
use
sample statistics t (or T)
to estimate the
population parameters θ.
For example,
to estimate the mean weight μ (the parameter)
of the fish population in the nearest lake,
you may catch a sample of fish and compute the mean weight
x (the statistic) of this sample
and declare it as an estimate for population mean μ.
Since t would only be an estimate of θ,
there would be an error
ε = |t - θ |. We would like this error
ε to be small or within our tolerable
(specified) limit ρ (say).
Further,
we would like this error ε to be within
our tolerable limit ρ,
more often than not.
For example, we may require
that the error ε
should be within our tolerance at least 90 percent times
(among all our trials).
In fact, our estimate t is a variable number and it varies each time we take a sample. We denote this variable by T. Whenever we have a sample,
T has a value T=t.
Indeed, T is a random variable
on the sample space of all the
possible samples.
Therefore, T has a probability distribution. Since T depends on samples,
its probability distribution is called a
sampling distribution.
When we say that "that error ε should be within
our tolerance ρ
at least 90 percent times",
we mean that
P(|T - θ | ≤ ρ) = .90 or more.
In particular,
the sample means x of numerical data
that we computed in
Lesson 2 would be the observed values
of a random variable
X, corresponding
to the sample data we had.
Similarly, the sample variances s2
that we computed in Lesson 2
would be the observed values of a random variable
S2. Each time you collect a sample (or data),
the computed sample
mean x
(respectively, the variance s2,
standard deviation s) would be
the value of the random variable
X (respectively, S2, S)
for that sample.
Example.
Suppose we want to study the height
distribution of the U.S. population.
Let X represent the height of the whole
US population.
We collect sample of size n. The sample would be n numbers
x1, x2, …, xn
representing the height of n individuals.
We shall consider the height xi of the
ith individual
as the observed value
of a random variable Xi.
Here Xi is the
notation for the height of the ith member of the sample, which
could be the height of anybody from population.
Therefore, these n
measurements
x1, x2, …, xn
are, respectively, the observed values of n random variables
X1, X2, …, Xn.
We (re)define the sample mean X
as the random
variable
We also (re)define sample variance S2 as
the random variable
S2 = |
1
n- 1
|
|
n
∑
i = 1
|
(Xi - |
X
|
)2. |
So, the sample means that
we computed before in Lesson 2 are the values of the random variable
X.
We will consider the
sampling distribution of
the sample mean X.
(In Lesson 7, we will briefly mention the sampling
distribution of the sample variance S2.)
Sampling Types
There are many ways to do sampling. Most commonly discussed
among them are
- Sampling without replacement,
- Sampling with replacement.
The Sampling without replacement is the
type of sampling where,
whenever a sample member is selected, the member is
excluded from the subsequent selections. It is analogous to
selecting n balls from a box of N balls.
Balls are selected one by one,
without replacing them back in the box before subsequent selections.
This type of sampling is meant to rule out the possibility of selecting
a member more than once. For small populations, possibility of selecting a memebr twice may be significant. For such small
populations, sampling without replacement
would be appropriate.
The Sampling with replacement is the
type of sampling where each selection is done without any regard to previous selections.
In other words, each time a sample member is drawn,
it is placed back to the whole population
before the next selection is made. This way,
each selection is done from the same whole population.
A member could, therefore, be selected more than once. This
may seem unnatural.
But when working with large populations this
is not likely to happen and is most natural from
the statistical point of view. (How often would one receives
calls
twice for the same poll?)
We will only consider sampling with
replacement .
Properties
Let X (like height) be a random variable
with mean μ and
variance σ2.
Let X1, X2, … , Xn
be a sample from the X - population.
We assume that the
sampling was done with replacement.
Such a sample has the following properties.
-
X would be called the parent population
or the population random
variable. Also μ and σ2
are called the population mean and variance.
- Each of the sample member Xi
has the same distribution
as X. So, mean of Xi is μ
and variance of Xi is σ2.
- The sample members X1,X2, …, Xn
are all mutually independent.
(In fact, one had to ensure that they
are drawn independently. )
- The distribution of X is called the sampling
distribution of X.
- Theorem. The mean of the sample mean
X is the population mean μ,
that is
E(X)
= E(X) = μ
The variance of the sample mean X is
given by
Var(X)
= σ2/n
So, the standard deviation of X, denoted
by σ X,
is given by
σX
= σ/√n.
- Definition. The standard deviation σX
is also called standard error.
6.2 Central Limit Theorem
The following theorem describes the sampling distribution of the Sample Mean. It is called the Central Limit Theorem (CLT)
Theorem (CLT).
Suppose X1,X2, …,Xn is a sample
from a population X with mean μ and
variance σ2.
Assume n is large.
Then the sample mean X is, approximately,
distributed as
N(μ,σX)
where σX=
σ/√n.
Therefore, approximately,
P(a < X
<b)=P(L < Z < R)
where L=(a-μ)/σ
X and R=(b-μ)/σ
X
OR
P(a < |
X
|
< b) = P |
|
a- μ
σ/√n
|
< Z < |
b-μ
σ/√n
|
|
. |
Further,
if the parent population X is Normal, then 1) and 2) are exact.
Standard Error and Precision
The standard error
σ
X=
σ
/√n
decreases to zero, as the sample size increases. Because of this,
while estimating the mean μ,
the sample mean X can
simultaneously achieve precision and level of confidence
(i.e. probability of a give precision), by increasing the
sample size n. The following animation is a demonstration of the same.
Problem Solving: The Central Limit Theorem (CLT) would be used to compute approximate probability for the sample mean
X. This would be similar to normal approximation to Binomial (Section 5.3). Steps we follow would be
- The mean
μX=μ
of
X remains the same as the population mean
μ.
- First, compute standard deviation of
σX
of X.
- Standardize
- Use the normalcdf function of TI-84.
Problems on 6.2: Central Limit Theorem
Exercise 6.2.1. It is known that the tuition
paid per semester by students in a university has a distribution with
mean $2,050 and standard deviation $310. If 64 students are interviewed,
what is the approximate probability that the sample mean tuition paid
will be above $2,060?
Solution:
Here the population mean μ = 2,050 and the population standard deviation σ = 310. The sample size n = 64.
First step is to compute the mean
μX =μ = 2050
and the standard deviation
σX
=
σ/
√n =
310/√64=38.75
Let X = Tuition paid by the students.
Then, the distribution of X
is, approximately, N(2050, 38.75)
Now " X will be above 2060" means
" X > 2060".
P(2060 < X)
= P([2060 - μ]/σX
<
[X - μ]/
σX )
≈
P([2060 - μ]/σ
X < Z )
= P([2060 - 2050]/38.75 < Z )
= P (.2580, < Z ) = normalcdf(.2580, 5)= .3982
Exercise 6.2.2.
The monthly water consumption X per household in a subdivision in Kansas
City has normal distribution with mean 15000 gallons and standard deviation
3000 gallons. What is the probability that the mean consumption of the
44 households in the subdivision will exceed 16000 gallons?
Solution:
Here the population mean μ = 15000 and the population standard deviation σ = 3000. The sample size n = 44.
First step is to compute the mean
μX
=μ = 15000.
and the standard deviation
σX
=
σ/
√n =
3000/√44=452.2670
Let X = monthly water consumption by the households.
Then, the distribution of X
is, approximately, N(15000, 452.2670)
Now " X will exceed 16000" means
" X > 16000".
P(16000 < X)
= P([16000 - μ]/σ
X
<
[X - μ]/
σX )
≈ P([16000 - μ]/σ
X
< Z )
= P([16000-15000]/452.2670 < Z )
= P (2.2111, < Z ) = normalcdf(2.2111, 5)= .0135
Exercise 6.2.3.
In a class of more than thousand students, the instructor announced after a test that the mean score was μ = 77 point and standard deviation σ = 24 points. You took a sample of 81 students. What would be the approximate probability that the sample mean would be less than 80?
Solution:
Here the population mean μ = 77 and the population standard deviation σ = 24. The sample size n = 81.
First step is to compute the mean
μX
=μ = 77.
and the standard deviation
σX
=
σ/
√n =
24/√81= 2.6667
Let X = Points scored by students.
Then, the distribution of X
is, approximately, N(77, 2.6667)
Now "the sample mean would be less than 80" means
" X < 80".
P(X < 80)
= P([X - μ]/
σX
<
[80 - μ]/σ
X)
≈ P(Z < [80 - μ]/σX
< Z )
= P(Z < [80 -77]/2.6667)
= P (Z < 1.1250) = normalcdf(-5, 1.1250)= .8697
Exercise 6.2.4.
The mean salary X of the university professors in a state is μ = $65,000 and standard deviation
σ = $14,000. You collect a sample of 75 professors. What is the probability that sample mean salary of these 75 professors would be above $60,000.
Solution:
Here the population mean μ = 65000 and the population standard deviation σ = 14000. The sample size n = 75.
First step is to compute the mean
μX
=μ = 65000.
and the standard deviation
σX
=
σ/
√n =
14000/√75=1616.5808
Let X = monthly water consumption by the households.
Then, the distribution of X
is, approximately, N(65000, 1616.5808)
Now " X would be above 60,000" means
" X > 60000".
P(16000 < X)
= P([60000 - μ]/σX
<
[X - μ]/
σX )
≈ P([16000 - μ]/σX ; < Z )
= P([60000-65000]/1616.5808 < Z )
= P (-3.0929, < Z ) = normalcdf(-3.0929, 5)= .9990
Exercise 6.2.5.The time X that a child spends watching TV on week- ends has a normal distribution with mean μ = 330 minutes and standard deviation σ = 95 minutes. You sample 50 kids in a school. What is the probability that the sample time
X
that these kids watch TV on a weekend will be less than 300 minutes.
The Following Problems are Posed in terms of the Total
Exercise 6.2.6. The weight X of fish in a lake has mean μ = 12 pounds and standard deviation σ = 4.5 pounds. Suppose you catch 150 fish. What is the probability that total weight of fish will be less than 1900 pounds?
Solution:
Here the population mean μ = 12 and the population standard deviation σ = 4.5. The sample size n = 150.
First step is to compute the mean
μX
=μ = 12.
and the standard deviation
σX
=
σ/
√n =
4.5/√150= .3674
Let X = Points scored by students.
Then, the distribution of X
is, approximately, N(12, .3674)
The problem is posed in terms of Total weight of all the fish. The sample mean
X= Total/n.
Now " Total weight will be less than 1900 pounds" means that "Total < 1900".
This means
" X =Total/n < 1900/n =1900/150 =12.6667".
P(X < 12.6667)
= P([X - μ]/
σX
<
[12.6667 - μ]/σ
X)
≈ P(Z < [12.6667 - μ]/σX)
= P(Z < [12.6667 - 12]/.3674)
= P (Z < 1.8146) = normalcdf(-5, 1.8146)= .9652
(
Well, you are fairly sure (96 percent sure) that you did not
catch 1900 pounds. )
Exercise 6.2.7.
The amount X of water used when a person takes a shower has a
mean μ = 30 gallons and standard deviation σ = 16 gallons.
Suppose 36 people take a shower in a swimming pool facility.
What is the probability that total of more than 900 gallons of
water will be used by these 36 people.
Solution:
Here the population mean μ = 30 and the population standard deviation σ = 16. The sample size n = 36.
First step is to compute the mean
μX
=μ = 30.
and the standard deviation
σX
=
σ/
√n =
16/√36=2.6667
Let X = water used when a person takes a shower.
Then, the distribution of X
is, approximately, N(30, 2.6667)
The problem is posed in terms of Total weight of a
ll the fish. The sample meanX
= Total/n.
Now " total of more than 900 gallons of water will be used" means that
"Total > 900".
This means" X =Total/n > 900/n = 900/36 = 25".
P(25 < X)
= P([25 - μ]/σX
<
[X - μ]/
σX )
≈ P([25 - μ]/σ
X ; < Z )
= P([25 - 30]/2.6667 < Z )
= P (-1.8750, < Z ) = normalcdf(-1.8750, 5)= .9696
(
Well, you are fairly sure (96 percent sure) that more
than 900 gallons will be used up. )
Exercise 6.2.8. The waiting time for the campus bus has a mean μ= 7 minutes and the standard deviation σ = 2 minutes. A student used the bus 120 times in a month. What is the probability that the student would have waited more than 900 minutes during the whole month?
Solution:
Here the population mean μ =7 and the population standard deviation σ = 2. The sample size n = 120.
First step is to compute the mean
μX
=μ =7.
and the standard deviation
σX
=
σ/
√n =
2/√120=
.1826
Let X = waiting time for the bus.
Then, the distribution of X
is, approximately, N(7, .1826)
The problem is posed in terms of Total weight of a
ll the fish. The sample meanX
= Total/n.
Now " total of more than 900 minutes will be spent" means that
"Total > 900".
This means" X =Total/n > 900/n = 900/120 = 7.5".
P(7.5 < X)
= P([7.5 - μ]/σX
<
[X - μ]/
σX )
≈ P([7.5 - μ]/σ
X ; < Z )
= P([7.5 - 7]/.1826 < Z )
= P (2.7382, < Z ) = normalcdf(2.7382, 5)= .0031
(
Well, the chances are fairly low that you will spend that kind of time waiting for the bus. )
Exercise 6.2.9. According to some data, the
annual Kansas wheat export X has a mean 733 million dollars and standard
deviation 163 million dollars. What is the probability that over the
next 10 years Kansas wheat exports will exceed 8040 million dollars?
Solution:
Here the population mean μ = 733 and the population standard deviation σ = 163. The sample size n = 10.
First step is to compute the mean
μX
=μ = 733.
and the standard deviation
σX
=
σ/
√n =
163/√10= 51.5451
Let X = Kansas wheat export annually.
Then, the distribution of X
is, approximately, N(733, 51.5451)
The problem is posed in terms of Total export in 10 years. Note that the sample mean
X= Total/n.
Now " Total export will exceed 8040" means that "Total > 8040".
This means
" X =Total/n > 8040/n =8040/10 =804".
P(804 < X)
= P([804 - μ]/σX
<
[X - μ]/
σX )
≈ P([804 - μ]/σ < Z )
= P([804 - 733]/51.5451 < Z )
= P (1.3774, < Z ) = normalcdf(1.3774, 5)= .0842
6.3 Sampling Distribution of the Sample Proportion
Suppose you are a statistical quality control (SQC) officer in a lamp
factory. Your job would include estimating proportion p of the defective lamps. When you test a lamp, it is a Bernoulli(p)-trial.
Correspondingly, a Bernoulli(p) random variable X is define as follows:
X=1 if success (i. e. defective)
X=0 if
failure (i. e. not defective)
The goal of this course
would also be to develop methods to estimate p, which is a parameter of this
Bernoulli(p) random variable X.
From Lesson 4, the mean μ and standard deviation σ of X are given by
μ = p
σ
=√p(1-p).
As usual, we will use a sample mean to estimate the mean μ = p. So, we take a sample X1,X2, …, Xn of size n from this Bernoulli( p) population and Xi represents the outcome of testing the ith lamp as follows:
Xi=1 if ith sample is a
success (i.e. the ith sample is defective)
Xi=0 if ith sample is a failure
(i.e. the ith sample is not defective).
An estimator of the mean μ = p would be the sample mean
X
=(X1+X2+… +Xn)/n
=T/n
where we write
T
=(X1+X2+… +Xn).
Since X
i is 1 or 0 according as the
ith trial
is success or failure (i.e. the i
th sample lamp is defective or not),
T = X1+X2+… +Xn =
the total Number of Success in these n trials
and the sample mean
X =T/n =
the Sample Proportion of Success (i.e =
the number of defective lamps) in these n trials.
To estimate p by the Sample Proportion of Success
X, knowledge of
its sampling distribution would be required.
By CLT, when n is large, the distribution of the
sample proportion of success X,
is approximately
N(p,σX)
where σX=√p(1-p)/n.
There is obviously nothing special about the testing lamps and
estimating proportion p of defective lamps produced in the factory.
The above applies to
any situation of Bernoulli(p)-trials.
Other examples would be estimating
(1) proportion p of the voter population who favors a particular candidate,
(2) proportion p of the population who has asthma
(3) proportion p of the seeds of a variety that germinates in a particular situation,
(4) proportion p of the population who benefit from a particular vaccine,
(5) proportion p of the population who live beyond 70.
The following theorem summarizes that the above discussion and describles the sampling distribution of the sample proportion of success.
Theorem.
Let p be the proportion of a population with a certain attribute.
Out of a sample of size n, suppose T have
the attribute (or is the number of success)
and X=T/n is the proportion of success.
If n is large and p is not too close to 0 or 1, then
the distribution of the
sample proportion of success X,
is approximately
N(p,σ
X)
where
σ
X
=√p(1-p)/n.
(Obviously, the mean of X,
μX = p.)
Therefore,
P(a < X
<b)=P(L < Z < R)
where L=(a - p)/
σX
and R=(b - p)/σX
Remark. As we discussed in Lesson 4, the number of success X has a Binomial(n, p)-distribution.
We also used normal approximation to Binomial in Section 5.3.
The normal approximation of the sample
proportion of success X
given above is not really different from the normal approximation
of the binomial random variable.
The main difference is that we have an eye
to use the distribution
of X to
estimate p.
For problem solving, another difference is that
we ignore continuity correction.
For large n, continuity correction is, in fact, negligible and will
not have any effect.
Problems on 6.3: Sample Proportion
Exercise 6.3.1. According to a report entitled
"Pediatric Nutrition Surveillance" published by Centers for Disease
Control (CDC) 18 percent of the children younger than two had anemia
in 1997. On a particular day in that year, a pediatrician examined 180
children.
What is the probability that the proportion will exceed 0.20?
( Equivalently, find the probability that the number T of children with anemia would exceed 180*.2 = 36.)
Solution:
Here the population mean p = .18 and the sample size n = 180.
First step is to compute the mean
μX
=p = .18
and the standard deviation of
X
σX
=
√p(1-p)/n
=
√.18(1- .18)/180
= .028636
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion of patients with anemia.
The distribution of X
is, approximately, N(.18, .028636)
Now " X will exceed 0.20" means
X > .20".
P(.20 < X)
= P([.20 - p]/σX
<
[X - p]/
σX )
≈ P([.20 - p]/σ
X ; < Z )
[ The Standardization Step. ]
= P([.20 - .18]/.028636 < Z )
= P (.6984, < Z ) = normalcdf(.6984, 5)= .2625
Exercise 6.3.2.
In 1996, the House of Representatives impeached President Clinton. As a
part of the political discourse, numerous polls were conducted and reported.
One poll claimed that
75 percent of eligible voters think the President should not be impeached. Suppose 700 voters were interviewed. Assuming the claim, what would be the probability that less than 72 percent (in this sample of 700) would have thought the President should not be impeached.
( Equivalently, find the probability that less than .72*700 = 504 voters would have thought the President should not be impeached .)
Solution:
Here the pop
ulation mean p = .75 and the sample size n = 700.
First step is to compute the mean
μX
=p = .18
and the standard deviation of X
σX
=
√p(1-p)/n
=
√.75(1-.75)/700
= .016366
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion of voters who thought that the President should not be impeached.
The distribution of X
is, approximately, N(.75, .016366)
Now " less than
72 percent would have thought the President should not be impeached" means that " X < .72".
P(X < .72)
= P([X - p]/
σX
<
[.72 - p]/σ
X)
≈ P(Z < [.72 - p]/σX)
[ The Standardization Step. ]
= P(Z < [.72 - .75]/.016366)
= P (Z < -1.8331) = normalcdf(-5, -1.8331)= .0334
Exercise 6.3.3.
It is believed proportion of voters (in a county)
who vote by absentee ballot is p=.22.
You sample 725 voters.
Compute an approximate the probability the sample proportion of absentee votes will exceed 25 percent.
( Equivalently, find the probability that the number of absentee votes will exceed 725*.22 = 159.5.)
Solution:
Here the pop
ulation mean p = .22 and the sample size n = 725.
First step is to compute the mean
μX
=p = .22
and the standard deviation of X
σX
=
√p(1-p)/n
=
√.22(1-.22)/725
= .015385
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion of absentee votes.
The distribution of X
is, approximately, N(.22, .015385)
Now "sample proportion of absentee votes will exceed 25 percent" " means
X will exceed 0.25".
That means
X > .25".
P(.25 < X)
= P([.25 - p]/σX
<
[X - p]/
σX )
≈ P([.25 - p]/σ
X ; < Z )
[ The Standardization Step. ]
= P([.25 - .22]/.015385 < Z )
= P (1.9500, < Z ) = normalcdf(1.9500, 5)= .0256
Exercise 6.3.4.
It is believed that 35 percent of the population in a county shop
in health food market. If you sample 800 individuals,
what would be an approximate the probability the sample proportion of
those who shop in health food market exceed 40 percent.
( Equivalently, find the probability that the number T
of those who shop in health food market would exceed 800*.40 = 320.)
Solution:
Here the pop
ulation mean p = .35 and the sample size n = 800.
First step is to compute the mean
μX
=p = .35
and the standard deviation of
X
σX
=
√p(1-p)/n
=
√.35(1-.35)/800
= .016863
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion those who shop in health food market.
The distribution of X
is, approximately, N(.35, .016863)
Now "sample proportion of those who shop in health food market will exceed
40 percent" " means
X will exceed 0.40".
That means
X > .40".
P(.40 < X)
= P([.40 - p]/σX
<
[X - p]/
σX )
≈ P([.40 - p]/σ
X ; < Z )
[ The Standardization Step. ]
= P([.40 - .35]/.016863 < Z )
= P (2.9651, < Z ) = normalcdf(2.9651, 5)= .0015
Exercise 6.3.5.
It is known that a vaccine may cause fever as side effect,
after one takes the shot.
The producer of the vaccine claims that only 17 percent
of those who take the shot experience such side effects.
You sample 978 individuals who took the shot.
What would be an approximate probability that more than 15 percent would experience side effect?
( Equivalently, find the probability that more than .15*978 =146.7 would experience side effect.)
Solution:
Here the pop
ulation mean p = .11 and the sample size n = 978.
First step is to compute the mean
μX
=p = .17
and the standard deviation of
X
σX
=
√p(1-p)/n
=
√.17(1-.17)/978
= .012011
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion those who experienced sided effect.
The distribution of X
is, approximately, N(.17, .012011)
Now "more than 15 percent would experience side effect " means
X will be more than .15".
That means
X > .15".
P(.15 < X)
= P([.15 - p]/σX
<
[X - p]/
σX )
≈ P([.15 - p]/σ
X ; < Z )
[ The Standardization Step. ]
= P([.15 - .17]/.012011 < Z )
= P (2.9651, < Z ) = normalcdf(-1.6651, 5)= .9521
Exercise 6.3.6.
About 27 percent of the population take flu shots.
You are in a class of 750 students.
Compute an approximate the probability the sample proportion of
those who took the shot would be less than 25 percent.
( Equivalently, find the probability that the number T
of those who took the shot would be less than .25*750= 167.5.)
Exercise 6.3.7.
It is known that 78 percent of the microwave ovens last more than
five years. A SQC inspector sampled 600 microwaves.
What would be the approximate probability that more than 78 percent of this sample would last more than
five years?
( Equivalently, find the probability that more than .78*600 = 468 of this sample would last more than five years.)
back to top