Math 105, Topics in Mathematics |
|||||||
Lesson 1: Data and Statistical StudiesIntroduction
We define statistics as the study of a large population on the basis of a small data sample. We make inferences about the population based on the sample data. What is data? The concept "data" is used in a very general sense, these days. It may mean different thing in different context or industry. In the computer industry the word "data" is used to mean any computer file. In telephone and IT industry by "data", they mean all these signals they send by cables or wireless; which may be just a sequence of 0s and 1s. We are mainly interested in numerical data. For us, data are numbers that describe a numerical characteristic of a certain number of members of the population. We may talk about data on height, weight, number of typos, and so on. In this course we talk about data in the context of statistics. In statistics, we try to understand a big population on the basis of a small sample. 1.1 What Is a Statistical Population?
In statistics we try to understand or make inferences or projections about a large collection of similar objects. Such a collection of individuals or objects under study is called a population. Example 1.1. The Following are examples of a population.
Often, we focus on a particular characteristic (like height, weight, annual income) of such populations in these examples, and consider the population as a collection of numbers. For example, if we are studying income distribution of the US population, we look at list of annual incomes of the whole U S population as the population. Also note, that this list of numbers or the population is unknown for a statistician; because if it was known then there will be nothing for the statistician to study there. The N-VALUE: The total number of members in the population under study is called the N-value of the population. In fact, it is more commonly known as the population size. The N-value is often unknown and must be estimated because either an accurate head count of all the members in the population is impossible or too expensive. In case this N-value will be unknown, and you may need to estimate the N-value by statistical methods. The following is a method of estimating N-values. 1.2 The Capture-recapture Method: Estimating N-value
Suppose we want to estimate the number of fish in Clinton Lake. Let N be the number of fish in the lake. Using the capture-recapture method, we do the following. Step 1. (The capture) Capture a sample of m fish, tag them, and release them back into the water. Step 2. (The recapture) After everything has settled down, capture a new sample of n fish. Count the number of tagged fish. Suppose that k of them are tagged. It is reasonable to assume that, approximatley, m/N=k/n. We have an estimate N of N given by
back to top ![]() 1.3 Statistical Studies: Census, Surveys, Public Opinion
Polls, Clinical Studies
CensusArticle 1 and Article 2 of the Constitution of the United States mandates that a national census be conducted every ten years. By census we mean an official enumeration of the population. Not only in the United States, but all over the world, a census is conducted every ten years. Following are some comments about census:
SurveysA more realistic and economical alternative to census is to collect data only from a small subgroup and then use this data to make inferences about the whole population. This approach is called a survey, and the subgroup of the population from which the data is collected is called a sample. The basic idea behind a survey is that if we can find a "representative" sample of the whole population (that means it is not biased) then anything we need to know about the population can be derived from that sample. Public Opinion PollsWe all know about public opinion polls. During any election season, about a dotzen polling organizations publish poll numbers. You can look at any standard textbook for a general discussion on polls. Some of them would explain how and why the predictions made by various opinion polls in the presidential elections in 1936 (Franklin Roosevelt vs. Alfred Landon) and 1948 (Harry Truman vs. Thomas Dewey) went wrong. It happened because the contemporary sampling methods were not sophisticated enough, and the samples the polsters drew failed to represent the whole population. These days, it is fairly common that about a dotzen polls, predicting opposite outcome during an election season. It happens because of their failure to collect a representative sample. Traditionally, such polls used the listing in the telephone books. In the recent past, the advent of cell phones have created a confussion in polling industry, because cell phones are not listed and some people do not own a land phone. Clinical StudiesWhen a vaccine or a new drug is tested, the statistical methods used are interesting. Following are some of the the main points regarding the process:
1.4 Sampling Methods
Random SamplingDeveloping a "representative sample" is a real challenge for a statistician (rest is mathematics and is easy because it has already been worked out). If a statistician tries to pick a sample, his/her human bias is essentially bound to result in a "biased sample." Whatever method we use to select a sample, the selection of the sample members must be random. That means that mathematics and methods of chance must guide the selection of sample members. A sample picked in such a manner is called a random sample, and the method is called random sampling. Another important concern regarding sampling is its cost. We briefly describe two methods of random sampling here.
Sample SizeThe sample size required for statistical studies need not be very large, even when the population is large. In practice, it is often less than 1500. If you follow CNN polls or others, they normally sample from 700 to 1200 people. Sampling : Terminology and Key ConceptsThe job of a statistician is to make inferences about a large population on the basis of a (small) sample.
Example. Suppose we want to understand the income distribution of the U.S. population, and we want to know the average income of the U.S. population.
Sampling ErrorA statistic used to estimate a parameter is only an estimate. We do not expect that the value of the statistic is exactly equal to the value of the parameter. In the above example, we would not expect the sample average income to be exactly equal to the average U.S. income. The difference between the actual value of the parameter and the computed/observed value of the statistic used to estimate the parameter is called the sampling error. There are two types of sampling errors:
|