Most people think of statistics as the study of the numerical features of a subject/population. It means the same to statisticians, but also emphasizes the methods of collecting data, summarizing and presenting data, and drawing inferences from data.
We all see on TV how political pundits justify opposing points of view by presenting statistics from respectable sources. How could something be a science when it justifies two opposing points of view? The answer is that statistics has a scientific basis but it can be misrepresented in use.
Example. During the saga of President Clinton's impeachment, we observed the following:
The implication here is that one of them was "wrong." But the science of statistics says that both were correct. Data was collected and analyzed, and it was found that the majority of Americans think that character matters and that the majority of Americans think the president is doing a good job. It does not matter to the science of statistics which one of the statistically established facts you or I want to believe.
Another point about the nature of statistics as a science is that it is not a deterministic science. It does not have laws like force is equal to mass times acceleration. Statements in statistics come with a probability (i.e., quantified chance) of being correct. When a weatherman says that it will rain today he means that there is, say, a ninety five percent chance that it will rain today. Roughly, this means that if he makes the same prediction one hundred times he will be correct 95 times, and it will not rain the other 5 days. The problem is that sometimes a weatherman will hide the information that there is a 95 percent chance only. Such information hiding is sometimes done for simplicity.
Before I conclude this introduction, let me tell you an interesting anecdote about the development of this subject. When the proposal to establish the Indian Statistical Institute in Calcutta was considered by the government of India in the early part of the last century, some critics said, then why not an institute in astrology? At the inception of statistics as a science there was a lot of skepticism about its scientific validity. Those days are gone, and statistics is not likened to astrology any more! Statistics is a well-founded and precise science. It is a nondeterministic science in nature; it makes precise probabilistic statements only.
In this course we will be talking about two branches of statistics. The first one is called descriptive statistics and deals with methods of processing, summarizing, and presenting data. The other part deals with the scientific methods of drawing inferences and forecasting from the data, and is called inferential or inductive statistics.
In statistics we use a small representative "sample" to study a big "population." The reason for this is the cost or even the impossibility of studying the whole population.
Population and Sample
Definitions. A complete collection of data on the group under study is called the population or the universe.
A member of the population is called a sampling unit. Therefore, the population consists of all its sampling units.
A Sample is a collection of sampling units selected from the population.
Most often, we will work with numerical characteristics (like height, weight, and salary) of a group. So usually the population is a large collection of numbers and the sample is a small subset of the population.
Example. Suppose we are studying the daily rainfall in Lawrence. Since daily rainfall could be from 0 inches to anything above 0, the population here is all nonnegative numbers (i.e., the interval [0, ∞)). A sample from this population would be the observed amount of daily rainfall in Lawrence on some number of days. A sample of size 11 would be the observed daily rainfall in Lawrence on 11 days.
Many definitions of variables are available in standard textbooks. For our purpose the following definition will suffice.
Definition. A variable is a rule or a formula or a mechanism that associates a value with each member of the population. So, given a member w, a variable X assigns a value X(w) to w. For us X(w) will be a characteristic (like height, weight, time, salary) of the population.
Example. Suppose we are studying the KU
student population. The population is the whole collection of KU students.
A KU student is a sample unit. If GPA is the "characteristic" that we
are studying, then X = the GPA of a student is a variable. So, given
a student, X has a value. For example:
On the other hand, if GENDER is the "characteristic" that we are studying,
then Y = gender of a student is a variable. So, given a student, Y has
a value. For example:
If HEIGHT is the characteristic that we are studying, then Z = height of students is a variable.
To give another example, if credit hours completed is the characteristic studied, T = the number of course credit hours completed so far by a student is a variable.
Similarly, given any other characteristic like weight, annual income, annual expenditure, you can construct a variable for this population.
A variable that takes numerical values is called a quantitative variable. So, the variables X, Z, and T above are quantitative variables, while Y is not. A variable that takes non-numerical values is called a qualitative variable. So, the variable Y above is a qualitative variable. We will mostly be concerned with quantitative variables.
We discuss two types of quantitative variables: continuous and discrete variables. A quantitative variable that can assume any numerical value over an interval is called a continuous variable. Since Z above can (hypothetically) assume any value between 0 to 100 inches, Z is a continuous variable. T assumes only integer values and is therefore not a continuous variable.
A different way to understand a discrete variable is that the possible values of the variable can be written down (or can be counted) in a (finite or infinite) list. We say that the values of a discrete variable are countable.
A quantitative variable is called a discrete variable if its possible values consist of breaks between successive values. If a variable assumes only a finite number of values, then it is also called a finite variable. Otherwise the variable is called an infinite variable. A finite variable is definitely a discrete variable. The variable T above is a discrete variable.
Examples of Continuous and Discrete Variables
Parameters and Statistics
Definition 1. Given a set of data, any numerical value computed from the data using a formula or a rule is called a quantitative measure of the data.
Definition 2. A quantitative measure of a population data is called a parameter. In other words, parameters belong to the whole population and are computed (if feasible) from the WHOLE population data. Examples: the average GPA of all KU students, the height of the tallest student in KU, the average income of the entire KU student population.
One way to study a population is to know some of the parameters of the population. Unfortunately, computing such parameters could be expensive or even impossible. Essentially, parameters are unknown and the main game of statistics is to try to estimate parameters on the basis of small samples collected from the population.
Definition 3. A quantitative measure of a sample data is called a statistic. So, any constant that we compute from a sample is a statistic. We use these statistics to estimate the parameters of the population. For example, the average height computed from a sample is a reasonable estimate for the (parameter) average height of the KU student population. Obviously, we do not expect the value of the statistic to be exactly equal to the parameter value. Hopefully, the error will be small or will exceed our tolerable limit very rarely (say once in a 100 trials).
Why do we need a statistic?
Sometimes it will be impossible to know the actual value of a parameter. For example, let μ be the mean length of the life of light bulbs produced by a company. In this case, the company cannot test all the bulbs it produces to find a mean length. So, the best it can do is to test a few bulbs, compute the sample mean length (a statistic) of the life of these bulbs and use it as an estimate for the mean length (parameter μ) of the life for all the bulbs it produces.
Definition 4. The data that has not been
processed or organized in any form is called raw
data. When the data is arranged in an increasing or decreasing
order, then it is called an array. The
range of the data is the difference between
the largest and the smallest value of the data.
range = highest value - lowest value.
In this section we talk about representation of data organized in tabular form. Such a representation is called a frequency distribution. We are mostly concerned with numerical data (i.e., quantititative data), but also consider some non-numerical data (i.e., qualitative data).
Example. (from Khazanie, p. 18) The following is data on the blood group of 36 patients in a hospital:
We have four types of blood groups, namely, O, A, B, AB. Each of these blood groups may be referred to as a "class." The frequency of a class is defined as the number of data members that belong to that class. For example, the frequency of the class O is 16; the frequency of class A is 14. A table that lists the classes and the corresponding frequency is called the frequency distribution of this qualitative data. Following is the frequency distribution of this data:
|relative frequency of x =|| frequency of x
total # of data points
|percentage frequency of x =|| frequency of x
total # of data points
The frequency table may also contain the relative and percentage frequency. Since we did not group the data into a few classes, we call this the frequency distribution of the ungrouped data.
Example 1.2.1 To estimate the mean time taken to complete a three-mile drive by a race car, the race car did several time trials, and the following sample of times taken (in seconds) to complete the laps was collected:
Note that there are 35 observations here. So we say that the size of the sample (or data) is 35. Also the values present are 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56. Since there are only 11 distinct values present we can make a frequency table for the ungrouped data. The following is the frequency distribution of this ungrouped data:
When we are working with a large set of data that has too many distinct class member (i.e., values) then we group the whole set of data into a few class intervals and give the corresponding "frequency" of the class. When the data is presented in this way, the data is called grouped data. The number of data members that fall in a class interval is called the class frequency and the relative and percentage frequencies are computed by the same formula as above. A list that gives various class intervals and the corresponding class frequencies in a tabular form is called a class frequency table or class frequency distribution of the data. The frequency distribution may also include the relative and percentage frequencies.
Grouped Data and Loss of Information
Sometimes it is convenient or necessary to group data into class intervals and construct a class frequency distribution. This is the case when there are too many distinct numbers present in the data—too many even to fit into a simple table on a page for presentation. In such situations, we group the data in a few class intervals. While class frequency distribution is very good for presentation and convenient for other reasons, we lose a lot of information in this process. There is no way we can recover the original data from the class frequency distribution.
Given a set of data, a good question would be, How many class intervals should we have? The answer is that it should not be too few nor should it be too many. If we take too few (say one), then all the information will be lost. On the other hand, if we take too many, we will have the problem of having to work with ungrouped data. (In this course we will always tell you how many classes to take.) Although sometimes it may be necessary to take class intervals of varying width, in this course we only consider classes of equal class width.
|class width = w =||R
Number of classes
[L,L+w],[L+w,L+2w],[L+2w, L+3w], ...,[H-w,H]
Since this definition creates an ambiguous situation in which a
data value may fall into two classes, we need a convention to address
A few more important definitions. The above intervals are called class intervals. The w above is called the class size or width. The lower end of the class is called lower limit and the upper end of the class is called upper limit. The class mark is the midpoint of the class, defined as follows:
|class mark =|| lower limit of class+
upper limit of class
A class limit is also called a class boundary. I took a slightly different approach when I defined the classes, so that for us class limits and class boundaries are the same. Although all the approaches are essentially the same, many slightly different approaches are possible depending on the situation.
Example 1.2.2 The following is the weight (in ounces), at birth, of a certain number of babies.
We will construct a class frequency table of this data by dividing the whole range of data into class intervals.
Solution: Note that the lowest value is 62 and the highest value is 156. We take L = 60, H = 160, so R = H-W = 100. We made such a choice of L and H, precisely so that R = 100 is a "nice" number. Now we decide to have 5 class intervals and so w = R/5 = 20. According to what I said above, our classes should be : [60, 80], [80,100], [100,120], [120,140], [140, 160]. But if we do so then there is a risk that some data members (like 80, 100, 120, 140) will fall in two classes. One way to avoid this is to add .5 to all the class boundaries. So, our classes are [60.5, 80.5], [80.5, 100.5], [100.5, 120.5], [120.5, 140.5], [140.5, 160.5].
So the frequency distribution is as follows:
|60.5 - 80.5||9||9/99||9.09|
|80.5 - 100.5||20||20/99||20.20|
|100.5 - 120.5||25||25/99||25.26|
|120.5 - 140.5||37||37/99||37.38|
|140.5 - 160.5||8||8/99||8.08|
Another way to represent data is to use pictures and graphs. We see such pictorial representation in newspapers and other sources every day. Pictorial representation is particularly important when you have to represent data to people with limited technical background, like newspaper readers or a governmental or congressional body.
The pie chart is a commonly used pictorial representation of data.
When you do your tax return every year, you find a few pie charts in
the instruction book for form 1040. These charts show what proportion/percentage
of each tax dollar goes for particular expenses. I reproduced the following
pie charts from the 1040 instruction book of 1999.
Among pictorial representations, the most useful in this course is the histogram. The histogram of data is the graphical representation of the frequency distribution of the data, where we plot the variable on the horizontal axis and above each class interval, we erect a bar of the height equal to the frequency of the class. Such a histogram is called a frequency histogram.
If, instead, we erect bars of height equal to the relative frequency, then the graph is called a relative frequency histogram. Similarly, we can construct a percentage frequency histogram.
The following is a histogram.
We have decided to avoid unequal class lengths, which makes our discussion of the histogram fairly simple.
Remark. Take a look at the Stem and Leaf Diagram discussed in any textbook.
Example 1.3.1. Following is the frequency table of data on height (in inches) of some babies at birth. Sketch the histogram of the following data:
For a given value x of a variable, the cumulative frequency of the data, for x, is the number of data members that are less than or equal to x.
Definition. Given a frequency distribution of some data, for a class boundary x, the cumulative frequency is the sum of all the class frequenies less or equal to x. The cumulative frequency distribution is a table that gives the cumulative frequencies against some x values (for us the class boundaries). We also define cumulative relative frequency and cumulative percentage frequency as follows:
cumulative relative frequency of x =
| cumulative frequency
total # of data points
|cumulative percentage frequency of x=|| cumulative frequency
total # of data points
Example 1.3.2 Once again we consider the data on birth weight of babies in Example 1.2 that we discussed in the last section. A cumulative frequency distribution can be constructed from the frequency distribution.
Solution: We have seen the frequency distribution before. The following is the cumulative distributions:
Definition. The ogive
is a line graph, where we plot the variable on the horizontal axis and
the cumulative frequency on the vertical axis. If we plot the cumulative
relative frequency on the vertical axis, then the line graph is called
the relative frequency ogive.
Because we will be using calculators (TI-83) extensively in this course, let me explain how you enter data in the TI-83.
|Use of Calculators (TI-83):|
| Enter Your Data:
It is not easy to construct a frequency table of a data set unless
you are systematic. Traditionally, we used "tally marks" to count the
frequency. Now you can use some software programs (e.g., Excel). Let
me show you a method, using a calculator (TI-83).
Exercise 1.2.1 To estimate the mean time taken to complete a three-mile drive by a race car, the race car did several time trials, and the following sample of times taken (in seconds) to complete the laps was collected:
The following is the frequency distribution of this ungrouped data:
Construct a histogram.
Exercise 1.2.2. The following is the weight (in ounces), at birth, of 96 babies born in Lawrence Memorial Hospital in May 2000.
Construct a class frequency table of this data by dividing the the
whole range of data into class intervals:
[60.5-70.5], [70.5-80.5], [80.5-90.5], [90.5-100.5], [100.5-110.5], [110.5-120.5], [120.5-130.5], [130.5-140.5], [140.5-150.5]
Exercise 1.2.3. The following are the length (in inches), at birth, of 96 babies born in Lawrence Memorial Hospital in May 2000.
Construct a frequency table for this data by dividing the whole range into class intervals:
[16-17], [17-18], [18-19], [19-20], [20-21], [21-22].
Note: If a data member falls on the boundary, count it in the
Exercise 1.2.4. The following data represents the number of typos in a sample of 30 books published by some publisher.
Construct a frequency table (by sorting in your calculator). Also construct
Exercise 1.2.5. Following is data on the hourly wages (paid only in whole dollars) in an industry.
Construct a frequency table (by sorting in your calculator). Also construct
Exercise 1.2.6. Following is data on the hourly wages (paid only in whole dollars) of 99 employees in an industry.
Construct a frequency table (by sorting in your calculator).