|
Lesson 1: Data and Statistical Studies
Introduction
We define statistics as the study of a large population on the basis
of a small data sample. We make inferences about the population based
on the sample data.
What is data?
The concept "data" is used in a very general sense, these days.
It may mean different thing in different context or industry.
In the computer industry
the word "data" is used to mean any computer file.
In telephone and IT industry by "data", they mean all these signals
they send by cables or wireless; which may be just a sequence of 0s and 1s.
We are mainly interested in numerical
data. For us, data are numbers that describe a numerical characteristic
of a certain number of members of the population. We may talk about
data on height, weight, number of typos, and so on.
In this course we talk about data in the context of statistics. In
statistics, we try to understand a big population on the basis of a
small sample.
1.1 What Is a Statistical Population?
In statistics we try to understand or make inferences or projections
about a large collection of similar objects. Such a collection of individuals
or objects under study is called a population.
Example 1.1. The Following are examples of a population.
- If we are studying the income distribution of US population, then the
population is the US population.
- If we are studying the income distribution of the immigrant American
population, then the population is the immigrant
American population.
- If we are studying the growth of the fish population in Clinton Lake,
then the population is the fish population in Clinton
Lake.
- If we are studying African elephants, then the population is
the population of African elephants.
Often, we focus on a particular characteristic
(like height, weight, annual income)
of such populations in these examples, and
consider the population as a collection of numbers.
For example, if we are studying income distribution of the US
population, we look at list of annual incomes of the whole U S population
as the population.
Also note, that this list of numbers or the population is unknown for
a statistician; because if it was known then there will be nothing
for the statistician to study there.
The N-VALUE: The total number of members in the population under study
is called the N-value of the population.
In fact, it is more commonly known as
the population size.
The N-value is often unknown and must be estimated because either
an accurate head count of all the members in the population is impossible
or too expensive. In case this N-value will be unknown, and you
may need to estimate the N-value by statistical methods.
The following is a method of estimating N-values.
1.2 The Capture-recapture Method: Estimating N-value
Suppose we want to estimate the number of fish in Clinton Lake. Let N be the number of fish in the lake. Using the capture-recapture method,
we do the following.
Step 1. (The capture) Capture a sample of m fish, tag them, and release them back into the water.
Step 2. (The recapture) After everything has settled down, capture a new sample of n fish. Count the number of tagged fish. Suppose that
k of them are tagged. It is reasonable to assume that,
approximatley, m/N=k/n.
We have an estimate N of N given by
Problems on 1.2: Estimating N-value
Exercise 1.2.1. As part of a project
we made two trips to a local lake. The first day we caught m=325 fish
and tagged them. On the second day we caught n=525 fish, and of those k=125 were tagged fish. Estimate the total number of fish in the lake.
Solution
|
Exercise 1.2.2. Last year you tagged m = 526 birds migrating through Lawrence. This year again you captured
n = 517 birds migrating through Lawrence, and of those k = 113 were
tagged last year. Estimate the total number of birds migrating through
Lawrence every year.
Solution
|
Exercise 1.2.3.
You want to estimate the number of homeless people in New York. On a night you identify 376 homeless people in New York. After some time, on another night you identify 497 homeless people. Of these 497, you found that 119 were identified last time as well. Estimate the number of homeless people in New York.
Solution
|
Exercise 1.2.4.
To estimate the number of tigers in Sunderban you capture 194 tigers and tag them. After some time you capture 212 tigers, and of those 87 were tagged. Estimate the number of tigers in Sunderban.
Solution
|
Additional Problems.
Visit More Problems
|
back to top
1.3 Statistical Studies: Census, Surveys, Public Opinion
Polls, Clinical Studies
Census
Article 1 and Article 2 of the Constitution of the United States mandates
that a national census be conducted every ten years. By census we mean
an official enumeration of the population. Not only in the United States,
but all over the world, a census is conducted every ten years.
Following are some comments about census:
- Originally the intent of the census was to count heads for taxes
and representation. That is why it may also become a political issue
as it did during the year 2000 census.
- Census is one major source of data about the population, and the
United Nations assumes a role in the worldwide census.
- Census has often failed to count all members of the population.
It is believed that a complete count is not really possible.
- In the 2000 census, the U.S. population was counted by using statistical
techniques. The Congress and the administration fought over this law,
and the law was challenged in the courts.
Surveys
A more realistic and economical alternative to census is to collect
data only from a small subgroup and then use this data to make inferences
about the whole population. This approach is called a survey,
and the subgroup of the population from which the data is collected
is called a sample.
The basic idea behind a survey is that if we can find a "representative"
sample of the whole population (that means it is not biased) then anything
we need to know about the population can be derived from that sample.
Public Opinion Polls
We all know about public opinion polls.
During any election season, about a dotzen polling organizations
publish poll numbers.
You can look at any standard textbook for a general discussion
on polls. Some of them would explain how and why the predictions
made by various opinion polls in the presidential elections
in 1936 (Franklin
Roosevelt vs. Alfred Landon) and 1948 (Harry Truman vs. Thomas Dewey)
went wrong. It happened because the contemporary sampling methods were
not sophisticated enough, and the samples the polsters drew failed to
represent the whole population.
These days, it is fairly common
that about a dotzen polls, predicting opposite outcome during an
election season. It happens
because of their failure to collect a representative sample.
Traditionally, such polls used the listing in the telephone books.
In the recent past, the advent of cell phones have created a confussion
in polling industry, because cell phones are not listed and some people do
not own a land phone.
Clinical Studies
When a vaccine or a new drug is tested, the statistical methods used
are interesting. Following are some of the the main points
regarding the process:
- We pick two samples to be called the control group and the treatment
group. The two samples need not have the same size.
- The treatment group receives the treatment, and the control group
does not receive the treatment.
- Both the groups are ignorant about who is receiving the treatment
and who is not.
- Finally, the two groups are compared. If the treatment group does
better than the control group, then it is accepted that the treatment
is working.
1.4 Sampling Methods
Random Sampling
Developing a "representative sample" is a real challenge for
a statistician (rest is mathematics and is easy because it
has already been worked out).
If a statistician tries to pick a sample, his/her human
bias is essentially bound to result in a "biased sample." Whatever
method we use to select a sample, the selection of the sample members
must be random. That means that mathematics and methods of chance must
guide the selection of sample members. A sample picked in such a manner
is called a random sample, and the method is
called random sampling.
Another important concern regarding sampling is its cost.
We briefly describe two methods of random sampling here.
- In the method of simple random sampling
each member of the population has an equal chance of being selected
in the sample.
- The other method of sampling is called stratified
sampling: First, divide the population into categories, called
strata, and randomly select a sample from these strata. Then further
divide the chosen strata into categories, called substrata, and select
a random sample of substrata from each of those strata. The process
is continued for a number of times.
Sample Size
The sample size required for statistical studies need not be very large,
even when the population is large. In practice, it is often less than
1500. If you follow CNN polls or others, they normally sample from 700
to 1200 people.
Sampling : Terminology and Key Concepts
The job of a statistician is to make inferences about a large population
on the basis of a (small) sample.
- Any numerical value computed from the sample data is called a statistic.
- Any numerical value computed from the whole population data is called
a parameter.
- Unless the population is small, the actual values of a parameter
is never known. However, because the samples are small, we can always
compute the actual values of the statistics. The game here is to estimate
the parameters by appropriate statistics.
Example. Suppose we want to understand the
income distribution of the U.S. population, and we want to know the
average income of the U.S. population.
- Here the average U.S. income is a parameter.
- Because it is almost impossible to compute the actual value of the
parameter average U.S. income, we take a sample (say of size 1500)
and compute the average income of the sample members.
This sample average is a statistic.
- It is reasonable to use this (statistic) sample average income as
an estimate for the (parameter) average U.S. income.
Sampling Error
A statistic used to estimate a parameter is only an estimate. We do
not expect that the value of the statistic is exactly equal to the value
of the parameter. In the above example, we would not expect the sample
average income to be exactly equal to the average U.S. income. The difference
between the actual value of the parameter and the computed/observed
value of the statistic used to estimate the parameter is called the
sampling error.
There are two types of sampling errors:
- Chance error: Because a sample is not
the whole population, a statistic cannot be the exact value of the
parameter. Given that other things are "perfect" and identical, two
different samples will produce two different values of the statistics.
You get different estimates (i.e., the value of the statistic) from
different samples, for the same parameter. The error in estimation
that arises this way is called the chance error.
This error arises out of the sampling variability, and the choice
of sample belongs to randomness or chance. Statisticians are comfortable
with chance error for various reasons.
- First, the very nature of statistical methods makes this error
unavoidable.
- Second, by increasing sample size this error can also be controlled.
- Finally, the statistician can tell us how often this error exceeds
a tolerable limit.
- Sampling bias: The error that arises from
poor sampling is called sampling bias. Although
many sophisticated methods of sampling are available, implementation
is not easy. In any case, sampling bias can be eliminated by strictly
and properly implementing the sampling methods. Of course, the cost
of sampling is the casualty.
|