Math 365, Elementary Statistics

Lesson 6 : Elements of Sampling Distribution

Satya Mandal

Due Date: Visit the homework site.

6.1 Sampling Distributionback to top

The goal of this course has been to develop methods and to use sample statistics t (or T) to estimate the population parameters θ. For example, to estimate the mean weight μ (the parameter) of the fish population in the nearest lake, you may catch a sample of fish and compute the mean weight x (the statistic) of this sample and declare it as an estimate for population mean μ.

Since t would only be an estimate of θ, there would be an error ε = |t - θ |. We would like this error ε to be small or within our tolerable (specified) limit ρ (say). Further, we would like this error ε to be within our tolerable limit ρ, more often than not. For example, we may require that the error ε should be within our tolerance at least 90 percent times (among all our trials).

In fact, our estimate t is a variable number and it varies each time we take a sample. We denote this variable by T. Whenever we have a sample, T has a value T=t.

Indeed, T is a random variable on the sample space of all the possible samples. Therefore, T has a probability distribution. Since T depends on samples, its probability distribution is called a sampling distribution. When we say that "that error ε should be within our tolerance ρ at least 90 percent times", we mean that

P(|T - θ | ≤ ρ) = .90 or more.

In particular, the sample means x of numerical data that we computed in Lesson 2 would be the observed values of a random variable X, corresponding to the sample data we had. Similarly, the sample variances s2 that we computed in Lesson 2 would be the observed values of a random variable S2. Each time you collect a sample (or data), the computed sample mean x (respectively, the variance s2, standard deviation s) would be the value of the random variable X (respectively, S2, S) for that sample.


Example. Suppose we want to study the height distribution of the U.S. population. Let X represent the height of the whole US population.

We collect sample of size n. The sample would be n numbers

x1, x2, …, xn

representing the height of n individuals.

We shall consider the height xi of the ith individual as the observed value of a random variable Xi.
Here Xi is the notation for the height of the ith member of the sample, which could be the height of anybody from population.
Therefore, these n measurements

x1, x2, …, xn

are, respectively, the observed values of n random variables

X1, X2, …, Xn.

We (re)define the sample mean X as the random variable


X
 
= X1+X2+…+Xn
n
.

We also (re)define sample variance S2 as the random variable

S2 = 1
n- 1
n

i = 1 
(Xi -
X
 
)2.

So, the sample means that we computed before in Lesson 2 are the values of the random variable X.


We will consider the sampling distribution of the sample mean X. (In Lesson 7, we will briefly mention the sampling distribution of the sample variance S2.)


Sampling Types

There are many ways to do sampling. Most commonly discussed among them are

  1. Sampling without replacement,
  2. Sampling with replacement.

The Sampling without replacement is the type of sampling where, whenever a sample member is selected, the member is excluded from the subsequent selections. It is analogous to selecting n balls from a box of N balls. Balls are selected one by one, without replacing them back in the box before subsequent selections. This type of sampling is meant to rule out the possibility of selecting a member more than once. For small populations, possibility of selecting a memebr twice may be significant. For such small populations, sampling without replacement would be appropriate.

The Sampling with replacement is the type of sampling where each selection is done without any regard to previous selections. In other words, each time a sample member is drawn, it is placed back to the whole population before the next selection is made. This way, each selection is done from the same whole population. A member could, therefore, be selected more than once. This may seem unnatural. But when working with large populations this is not likely to happen and is most natural from the statistical point of view. (How often would one receives calls twice for the same poll?) We will only consider sampling with replacement .


Properties

Let X (like height) be a random variable with mean μ and variance σ2. Let X1, X2, … , Xn be a sample from the X - population. We assume that the sampling was done with replacement. Such a sample has the following properties.

  1. X would be called the parent population or the population random variable. Also μ and σ2 are called the population mean and variance.
  2. Each of the sample member Xi has the same distribution as X. So, mean of Xi is μ and variance of Xi is σ2.
  3. The sample members X1,X2, …, Xn are all mutually independent. (In fact, one had to ensure that they are drawn independently. )
  4. The distribution of X is called the sampling distribution of X.
  5. Theorem. The mean of the sample mean X is the population mean μ, that is

    E(X) = E(X) = μ

    The variance of the sample mean X is given by

    Var(X) = σ2/n

    So, the standard deviation of X, denoted by σ X, is given by

    σX = σ/n.

  6. Definition. The standard deviation σX is also called standard error.

6.2 Central Limit Theoremback to top

The following theorem describes the sampling distribution of the Sample Mean. It is called the Central Limit Theorem (CLT)

Theorem (CLT). Suppose X1,X2, …,Xn is a sample from a population X with mean μ and variance σ2.

Assume n is large. Then the sample mean X is, approximately, distributed as


N(μ,σX)

where
σX= σ/n.


Therefore, approximately,


P(a < X <b)=P(L < Z < R)

where L=(a-μ)/σ X    and    R=(b-μ)/σ X

OR

P(a <
X
 
< b) = P left bracket a- μ
σ/n
< Z < b-μ
σ/n
right bracket .

Further, if the parent population X is Normal, then 1) and 2) are exact.

Standard Error and Precision

The standard error σ X= σ /n decreases to zero, as the sample size increases. Because of this, while estimating the mean μ, the sample mean X can simultaneously achieve precision and level of confidence (i.e. probability of a give precision), by increasing the sample size n. The following animation is a demonstration of the same.

Animation 6.2.1

Problem Solving: The Central Limit Theorem (CLT) would be used to compute approximate probability for the sample mean X. This would be similar to normal approximation to Binomial (Section 5.3). Steps we follow would be

  1. The mean

    μX

    of X remains the same as the population mean μ.
  2. First, compute standard deviation of

    σX

    of X.
  3. Standardize
  4. Use the normalcdf function of TI-84.

Problems on 6.2: Central Limit Theorem

Exercise 6.2.1. It is known that the tuition paid per semester by students in a university has a distribution with mean $2,050 and standard deviation $310. If 64 students are interviewed, what is the approximate probability that the sample mean tuition paid will be above $2,060?

Solution:
Here the population mean μ = 2,050 and the population standard deviation σ = 310. The sample size n = 64.
First step is to compute the mean
μX =μ = 2050
and the standard deviation
σX = σ/ n = 310/64=38.75
Let X = Tuition paid by the students. Then, the distribution of X is, approximately, N(2050, 38.75)

Now " X will be above 2060" means " X > 2060".
P(2060 < X)
= P([2060 - μ]/σX < [X - μ]/ σX )
≈ P([2060 - μ]/σ X < Z )
= P([2060 - 2050]/38.75 < Z )
= P (.2580, < Z ) = normalcdf(.2580, 5)= .3982

Exercise 6.2.2.
The monthly water consumption X per household in a subdivision in Kansas City has normal distribution with mean 15000 gallons and standard deviation 3000 gallons. What is the probability that the mean consumption of the 44 households in the subdivision will exceed 16000 gallons?

Solution:
Here the population mean μ = 15000 and the population standard deviation σ = 3000. The sample size n = 44.
First step is to compute the mean
μX =μ = 15000.
and the standard deviation
σX = σ/ n = 3000/44=452.2670
Let X = monthly water consumption by the households. Then, the distribution of X is, approximately, N(15000, 452.2670)

Now " X will exceed 16000" means " X > 16000".
P(16000 < X)
= P([16000 - μ]/σ X < [X - μ]/ σX )
≈ P([16000 - μ]/σ X < Z )
= P([16000-15000]/452.2670 < Z )
= P (2.2111, < Z ) = normalcdf(2.2111, 5)= .0135

Exercise 6.2.3. In a class of more than thousand students, the instructor announced after a test that the mean score was μ = 77 point and standard deviation σ = 24 points. You took a sample of 81 students. What would be the approximate probability that the sample mean would be less than 80?

Solution:
Here the population mean μ = 77 and the population standard deviation σ = 24. The sample size n = 81.
First step is to compute the mean
μX =μ = 77.
and the standard deviation
σX = σ/ n = 24/81= 2.6667
Let X = Points scored by students. Then, the distribution of X is, approximately, N(77, 2.6667)

Now "the sample mean would be less than 80" means " X < 80".

P(X < 80)
= P([X - μ]/ σX < [80 - μ]/σ X)
≈ P(Z < [80 - μ]/σX < Z )
= P(Z < [80 -77]/2.6667)
= P (Z < 1.1250) = normalcdf(-5, 1.1250)= .8697

Exercise 6.2.4. The mean salary X of the university professors in a state is μ = $65,000 and standard deviation σ = $14,000. You collect a sample of 75 professors. What is the probability that sample mean salary of these 75 professors would be above $60,000.

Solution:
Here the population mean μ = 65000 and the population standard deviation σ = 14000. The sample size n = 75.
First step is to compute the mean
μX =μ = 65000.
and the standard deviation
σX = σ/ n = 14000/75=1616.5808
Let X = monthly water consumption by the households. Then, the distribution of X is, approximately, N(65000, 1616.5808)

Now " X would be above 60,000" means " X > 60000".

P(16000 < X)
= P([60000 - μ]/σX < [X - μ]/ σX )
≈ P([16000 - μ]/σX ; < Z )
= P([60000-65000]/1616.5808 < Z )
= P (-3.0929, < Z ) = normalcdf(-3.0929, 5)= .9990

Exercise 6.2.5.The time X that a child spends watching TV on week- ends has a normal distribution with mean μ = 330 minutes and standard deviation σ = 95 minutes. You sample 50 kids in a school. What is the probability that the sample time X that these kids watch TV on a weekend will be less than 300 minutes.



The Following Problems are Posed in terms of the Total

Exercise 6.2.6. The weight X of fish in a lake has mean μ = 12 pounds and standard deviation σ = 4.5 pounds. Suppose you catch 150 fish. What is the probability that total weight of fish will be less than 1900 pounds?

Solution:
Here the population mean μ = 12 and the population standard deviation σ = 4.5. The sample size n = 150.
First step is to compute the mean
μX =μ = 12.
and the standard deviation
σX = σ/ n = 4.5/150= .3674
Let X = Points scored by students. Then, the distribution of X is, approximately, N(12, .3674)


The problem is posed in terms of Total weight of all the fish. The sample mean X= Total/n.

Now " Total weight will be less than 1900 pounds" means that "Total < 1900".
This means " X =Total/n < 1900/n =1900/150 =12.6667".
P(X < 12.6667)
= P([X - μ]/ σX < [12.6667 - μ]/σ X)
≈ P(Z < [12.6667 - μ]/σX)
= P(Z < [12.6667 - 12]/.3674)
= P (Z < 1.8146) = normalcdf(-5, 1.8146)= .9652

( Well, you are fairly sure (96 percent sure) that you did not catch 1900 pounds. )

Exercise 6.2.7. The amount X of water used when a person takes a shower has a mean μ = 30 gallons and standard deviation σ = 16 gallons. Suppose 36 people take a shower in a swimming pool facility. What is the probability that total of more than 900 gallons of water will be used by these 36 people.

Solution:
Here the population mean μ = 30 and the population standard deviation σ = 16. The sample size n = 36.
First step is to compute the mean
μX =μ = 30.
and the standard deviation
σX = σ/ n = 16/36=2.6667
Let X = water used when a person takes a shower.
Then, the distribution of X is, approximately, N(30, 2.6667)

The problem is posed in terms of Total weight of a ll the fish. The sample meanX = Total/n.

Now " total of more than 900 gallons of water will be used" means that "Total > 900".
This means" X =Total/n > 900/n = 900/36 = 25".

P(25 < X)
= P([25 - μ]/σX < [X - μ]/ σX )
≈ P([25 - μ]/σ X ; < Z )
= P([25 - 30]/2.6667 < Z )
= P (-1.8750, < Z ) = normalcdf(-1.8750, 5)= .9696

( Well, you are fairly sure (96 percent sure) that more than 900 gallons will be used up. )

Exercise 6.2.8. The waiting time for the campus bus has a mean μ= 7 minutes and the standard deviation σ = 2 minutes. A student used the bus 120 times in a month. What is the probability that the student would have waited more than 900 minutes during the whole month?

Solution:
Here the population mean μ =7 and the population standard deviation σ = 2. The sample size n = 120.
First step is to compute the mean
μX =μ =7.
and the standard deviation
σX = σ/ n = 2/120= .1826
Let X = waiting time for the bus.
Then, the distribution of X is, approximately, N(7, .1826)

The problem is posed in terms of Total weight of a ll the fish. The sample meanX = Total/n.

Now " total of more than 900 minutes will be spent" means that "Total > 900".
This means" X =Total/n > 900/n = 900/120 = 7.5".

P(7.5 < X)
= P([7.5 - μ]/σX < [X - μ]/ σX )
≈ P([7.5 - μ]/σ X ; < Z )
= P([7.5 - 7]/.1826 < Z )
= P (2.7382, < Z ) = normalcdf(2.7382, 5)= .0031

( Well, the chances are fairly low that you will spend that kind of time waiting for the bus. )

Exercise 6.2.9. According to some data, the annual Kansas wheat export X has a mean 733 million dollars and standard deviation 163 million dollars. What is the probability that over the next 10 years Kansas wheat exports will exceed 8040 million dollars?

Solution:
Here the population mean μ = 733 and the population standard deviation σ = 163. The sample size n = 10.
First step is to compute the mean
μX =μ = 733.
and the standard deviation
σX = σ/ n = 163/10= 51.5451
Let X = Kansas wheat export annually. Then, the distribution of X is, approximately, N(733, 51.5451)


The problem is posed in terms of Total export in 10 years. Note that the sample mean X= Total/n.

Now " Total export will exceed 8040" means that "Total > 8040".
This means " X =Total/n > 8040/n =8040/10 =804".
P(804 < X)
= P([804 - μ]/σX < [X - μ]/ σX )
≈ P([804 - μ]/σ < Z )
= P([804 - 733]/51.5451 < Z )
= P (1.3774, < Z ) = normalcdf(1.3774, 5)= .0842


6.3 Sampling Distribution of the Sample Proportionback to top

Suppose you are a statistical quality control (SQC) officer in a lamp factory. Your job would include estimating proportion p of the defective lamps. When you test a lamp, it is a Bernoulli(p)-trial. Correspondingly, a Bernoulli(p) random variable X is define as follows:

X=1 if success (i. e. defective)
X=0 if failure (i. e. not defective)

The goal of this course would also be to develop methods to estimate p, which is a parameter of this Bernoulli(p) random variable X. From Lesson 4, the mean μ and standard deviation σ of X are given by

μ = p           σ =p(1-p).

As usual, we will use a sample mean to estimate the mean μ = p. So, we take a sample X1,X2, …, Xn of size n from this Bernoulli( p) population and Xi represents the outcome of testing the ith lamp as follows:

Xi=1 if ith sample is a success (i.e. the ith sample is defective)
Xi=0 if ith sample is a failure (i.e. the ith sample is not defective).

An estimator of the mean μ = p would be the sample mean

X =(X1+X2+ +Xn)/n =T/n      where we write     T =(X1+X2+ +Xn).

Since Xi is 1 or 0 according as the ith trial is success or failure (i.e. the ith sample lamp is defective or not),

T = X1+X2+… +Xn = the total Number of Success in these n trials

and the sample mean

X =T/n = the Sample Proportion of Success (i.e = the number of defective lamps) in these n trials.

To estimate p by the Sample Proportion of Success X, knowledge of its sampling distribution would be required. By CLT, when n is large, the distribution of the sample proportion of success X, is approximately

N(p,σX)         where    σX=p(1-p)/n.

There is obviously nothing special about the testing lamps and estimating proportion p of defective lamps produced in the factory. The above applies to any situation of Bernoulli(p)-trials. Other examples would be estimating (1) proportion p of the voter population who favors a particular candidate, (2) proportion p of the population who has asthma (3) proportion p of the seeds of a variety that germinates in a particular situation, (4) proportion p of the population who benefit from a particular vaccine, (5) proportion p of the population who live beyond 70.

The following theorem summarizes that the above discussion and describles the sampling distribution of the sample proportion of success.


Theorem. Let p be the proportion of a population with a certain attribute. Out of a sample of size n, suppose T have the attribute (or is the number of success) and X=T/n is the proportion of success. If n is large and p is not too close to 0 or 1, then the distribution of the sample proportion of success X, is approximately

N(p,σ X)         where     σ X =p(1-p)/n.         (Obviously, the mean of X, μX = p.)

Therefore,

P(a < X <b)=P(L < Z < R)

where L=(a - p)/
σX    and    R=(b - p)/σX

Remark. As we discussed in Lesson 4, the number of success X has a Binomial(n, p)-distribution. We also used normal approximation to Binomial in Section 5.3. The normal approximation of the sample proportion of success X given above is not really different from the normal approximation of the binomial random variable. The main difference is that we have an eye to use the distribution of X to estimate p. For problem solving, another difference is that we ignore continuity correction. For large n, continuity correction is, in fact, negligible and will not have any effect.


Problems on 6.3: Sample Proportion


Exercise 6.3.1. According to a report entitled "Pediatric Nutrition Surveillance" published by Centers for Disease Control (CDC) 18 percent of the children younger than two had anemia in 1997. On a particular day in that year, a pediatrician examined 180 children. What is the probability that the proportion will exceed 0.20? ( Equivalently, find the probability that the number T of children with anemia would exceed 180*.2 = 36.)

Solution:
Here the population mean p = .18 and the sample size n = 180.
First step is to compute the mean
μX =p = .18
and the standard deviation of X

σX = p(1-p)/n = .18(1- .18)/180 = .028636
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion of patients with anemia.
The distribution of X is, approximately, N(.18, .028636)


Now " X will exceed 0.20" means X > .20".

P(.20 < X)
= P([.20 - p]/σX < [X - p]/ σX )
≈ P([.20 - p]/σ X ; < Z )                  [ The Standardization Step. ]
= P([.20 - .18]/.028636 < Z )
= P (.6984, < Z ) = normalcdf(.6984, 5)= .2625

Exercise 6.3.2. In 1996, the House of Representatives impeached President Clinton. As a part of the political discourse, numerous polls were conducted and reported. One poll claimed that 75 percent of eligible voters think the President should not be impeached. Suppose 700 voters were interviewed. Assuming the claim, what would be the probability that less than 72 percent (in this sample of 700) would have thought the President should not be impeached. ( Equivalently, find the probability that less than .72*700 = 504 voters would have thought the President should not be impeached .)

Solution:
Here the pop ulation mean p = .75 and the sample size n = 700.
First step is to compute the mean
μX =p = .18
and the standard deviation of X
σX = p(1-p)/n = .75(1-.75)/700 = .016366
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion of voters who thought that the President should not be impeached.
The distribution of X is, approximately, N(.75, .016366)


Now " less than 72 percent would have thought the President should not be impeached" means that " X < .72".
P(X < .72)
= P([X - p]/ σX < [.72 - p]/σ X)
≈ P(Z < [.72 - p]/σX)                  [ The Standardization Step. ]
= P(Z < [.72 - .75]/.016366)
= P (Z < -1.8331) = normalcdf(-5, -1.8331)= .0334

Exercise 6.3.3. It is believed proportion of voters (in a county) who vote by absentee ballot is p=.22. You sample 725 voters. Compute an approximate the probability the sample proportion of absentee votes will exceed 25 percent. ( Equivalently, find the probability that the number of absentee votes will exceed 725*.22 = 159.5.)

Solution:
Here the pop ulation mean p = .22 and the sample size n = 725.
First step is to compute the mean
μX =p = .22
and the standard deviation of X

σX = p(1-p)/n = .22(1-.22)/725 = .015385
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion of absentee votes.
The distribution of X is, approximately, N(.22, .015385)

Now "sample proportion of absentee votes will exceed 25 percent" " means X will exceed 0.25". That means X > .25".

P(.25 < X)
= P([.25 - p]/σX < [X - p]/ σX )
≈ P([.25 - p]/σ X ; < Z )                  [ The Standardization Step. ]
= P([.25 - .22]/.015385 < Z )
= P (1.9500, < Z ) = normalcdf(1.9500, 5)= .0256

Exercise 6.3.4. It is believed that 35 percent of the population in a county shop in health food market. If you sample 800 individuals, what would be an approximate the probability the sample proportion of those who shop in health food market exceed 40 percent. ( Equivalently, find the probability that the number T of those who shop in health food market would exceed 800*.40 = 320.)

Solution:
Here the pop ulation mean p = .35 and the sample size n = 800.
First step is to compute the mean
μX =p = .35

and the standard deviation of X

σX = p(1-p)/n = .35(1-.35)/800 = .016863
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion those who shop in health food market.
The distribution of X is, approximately, N(.35, .016863)

Now "sample proportion of those who shop in health food market will exceed 40 percent" " means X will exceed 0.40". That means X > .40".

P(.40 < X)
= P([.40 - p]/σX < [X - p]/ σX )
≈ P([.40 - p]/σ X ; < Z )                  [ The Standardization Step. ]
= P([.40 - .35]/.016863 < Z )
= P (2.9651, < Z ) = normalcdf(2.9651, 5)= .0015

Exercise 6.3.5. It is known that a vaccine may cause fever as side effect, after one takes the shot. The producer of the vaccine claims that only 17 percent of those who take the shot experience such side effects. You sample 978 individuals who took the shot. What would be an approximate probability that more than 15 percent would experience side effect? ( Equivalently, find the probability that more than .15*978 =146.7 would experience side effect.)

Solution:
Here the pop ulation mean p = .11 and the sample size n = 978.
First step is to compute the mean
μX =p = .17

and the standard deviation of X

σX = p(1-p)/n = .17(1-.17)/978 = .012011
(We take unto six decimal points, because we are already working with small numbers.)
Here X = the sample proportion those who experienced sided effect.
The distribution of X is, approximately, N(.17, .012011)

Now "more than 15 percent would experience side effect " means X will be more than .15". That means X > .15".

P(.15 < X)
= P([.15 - p]/σX < [X - p]/ σX )
≈ P([.15 - p]/σ X ; < Z )                  [ The Standardization Step. ]
= P([.15 - .17]/.012011 < Z )
= P (2.9651, < Z ) = normalcdf(-1.6651, 5)= .9521

Exercise 6.3.6. About 27 percent of the population take flu shots. You are in a class of 750 students. Compute an approximate the probability the sample proportion of those who took the shot would be less than 25 percent. ( Equivalently, find the probability that the number T of those who took the shot would be less than .25*750= 167.5.)

Exercise 6.3.7. It is known that 78 percent of the microwave ovens last more than five years. A SQC inspector sampled 600 microwaves. What would be the approximate probability that more than 78 percent of this sample would last more than five years? ( Equivalently, find the probability that more than .78*600 = 468 of this sample would last more than five years.)

back to top