Math 365, Elementary Statistics |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Chapter 1: The Language and TerminologySatya Mandal
IntroductionMost people understand that statistics is the study of the numerical features of a subject/population. Understanding of the statisticians would not be much different. A statistician would only emphasize on how they do it, in addition. Some statisticians would define statistics as the scientific and mathematical study of the methods of collecting data, summarizing and presenting data, and drawing inferences from data. American Statistical Association defines statistics follows (click for further updated version): Statistics is the scientific application of mathematical principles to the collection, analysis, and presentation of numerical data. Statisticians contribute to scientific enquiry by applying their mathematical and statistical knowledge to the design of surveys and experiments; the collection, processing, and analysis of data; and the interpretation of the results. The goal of this course is to learn some commonly used methods to use collected sample data to draw inferences about a population and the mathematical basis behind such methods. The following is an example. Example. Suppose you want to estimate the mean (average) weight of the fish polulation in the nearest lake. The mean weight is a charecteristic of the whole fish population in the lake. To estimate it, you catch a small sample of fish from the lake. Then compute the mean weight of the sample, to be called the sample mean. Then, declair that this sample mean is an estimate of the mean weight of the whole population (also called the population mean). Another point about the nature of statistics as a science is that it is not a deterministic science. It does not have laws like force is equal to mass times acceleration. Statements in statistics come with a probability (i.e., quantified chance) of being correct. When a weatherman says that it will rain today he means that there is, say, a ninety five percent chance that it will rain today. Roughly, this means that if he makes the same prediction one hundred times he will be correct 95 times, and it will not rain the other 5 days. The problem is that sometimes a weatherman will hide the information that there is a 95 percent chance only. Such information hiding is sometimes done for simplicity. SkepticismSkepticism about statistics is widespread and often justifiably so. It may not be an overstatement to say that statistics is misused and abused on regular basis. To put it sarcastically, abuse of statistics to generate opinion may already be a brunch of science or sociology based on scientific theory and models. The part that is based on scientific models may sometimes be ethically wrong, its scientific validity cannot be denied. Unfortunately, such methods include misleading the public with false data and misinformation. On Sunday talk shows pundits and the political opinion makers try to justify opposing point of virews, sometimes based on data from respectable sources. It would be a fair question, how could something be a science when it justifies two opposing point of views? While there is no cure for misleading or incorrect information, sometimes both may be statistitically correct with emphasis on the different aspects of the statistical inferences. Following is an example. Example. In December 1998, the House of Representatives impeached President Bill Clinton. In February 1999, President was acquitted by the Senate. (In impeachment trial, the house works like the prosecutor and the Senate works like the jury. Search internet for more information). President Clinton was formally charged with perjury and obstruction of justice. In any case, both stemmed out of allegations of sexual liaison and harassment. During this process of impeachment, there was long political discourse with respect to morality and legalities of the whole episode. Following would be typical discussion on TV.
The implication here is that one of them was "wrong." But the science of statistics says that both were correct. Data was collected and analyzed, and it was found that the majority of Americans think that character matters and that the majority of Americans think the president is doing a good job. It does not matter to the science of statistics which one of the statistically established facts one would have desired. Historically, during the early part of development of statistics, skepticism was of different nature. The validity of scientific foundation was in question. Statistics was compared with astrology, because both do predictions regarding unknown. An anecdote follows. When the proposal to establish the Indian Statistical Institute in Calcutta was considered by the government of India in the early part of the last century, some critics said, then why not an institute in astrology? At the inception of statistics as a science there was a lot of skepticism about its scientific validity. Those days are gone, and statistics is not likened to astrology any more! Statistics is a well-founded and precise science. It is a nondeterministic science in nature; it makes precise probabilistic statements only. Descriptive and Inferential StatisticsIn this course we will be talking about two branches of statistics. The first one is called descriptive statistics which deals with methods of processing, summarizing, and presenting data. The other part deals with the scientific methods of drawing inferences and forecasting from the data, and is called inferential or inductive statistics. Course OrganizationThe course has nine lessons that can be divided into three parts:
In the rest of this lesson and the next we deal with descriptive statistics, which include the presentation of data in the form of tables, graphs, and computations of various averages of data. Basic Definitions and ConceptsIn statistics, we use a small "sample" to make inferences about a "big population". Statistics serves a purpose only when we do not have a way to find full or accurate information about the whole population. Sometimes, the population is such that it is intrinsically impossible to find full and accurate information. The same may be the situation because of the cost associated with full enumeration. The following are some example:
Population and SampleDefinitions. A complete collection of data on the group under study is called the population or the universe. A member of the population is called a sampling unit. Therefore, the population consists of all its sampling units. A Sample is a collection of sampling units selected from the population. Most often, we will work with numerical characteristics (like height, weight, and salary) of a group. So usually the population is a large collection of numbers and the sample is a small subset of the population. Example. Suppose we are studying the daily rainfall in Lawrence. Since daily rainfall could be from 0 inches to anything above 0, the population here is all nonnegative numbers (i.e., the interval [0, ∞)). A sample from this population would be the observed amount of daily rainfall in Lawrence on some number of days. A sample of size 11 would be the observed daily rainfall in Lawrence on 11 days. VariablesA variable is something that varies or changes value. Most often, we consider numerical variables. Numerical variables are also called quantitative variables. Examples of quantitative variables include height, length, weight, number of typos in books, number of credit hours completed by students, number of accidents (or number of anything) and time. Non-numerical variables are also considered. They are called qualitative variables. Examples of qualitative variables include blood group and gender. In fact, any genetic property (genotype or phenotype) is a qualitative variable, because they vary from human to human (or trees to trees). In Chapter 4, we will have an elaborate discussion on a specific type of variables called random variables, which would be more relevant for our purpose. Parameters and StatisticsDefinition 1. Given a set of data, any numerical value computed from the data using a formula or a rule is called a quantitative measure of the data. Definition 2. A quantitative measure of a population data is called a parameter. In other words, parameters belong to the whole population and are computed (if feasible) from the WHOLE population data. Examples: the average GPA of all KU students, the height of the tallest student in KU, the average income of the entire KU student population. One way to study a population is to know some of the parameters of the population. Unfortunately, computing such parameters could be expensive or even impossible. Essentially, parameters are unknown and the main game of statistics is to try to estimate parameters on the basis of small samples collected from the population. Definition 3. A quantitative measure of a sample data is called a statistic. Any constant that we compute from a sample is a statistic. We use these statistics to estimate the parameters of the population. For example, the average height computed from a sample is a reasonable estimate for the (parameter) average height of the KU student population. Obviously, we do not expect the value of the statistic to be exactly equal to the parameter value. Hopefully, the error will be small or will exceed our tolerable limit very rarely (say once in a 100 trials). Why do we need a statistic? Sometimes it will be impossible to know the actual value of a parameter. For example, let μ be the mean length of the life of light bulbs produced by a company. In this case, the company cannot test all the bulbs it produces to find a mean length. So, the best that that we can do is to we test a few bulbs (the sample), compute the sample mean length (a statistic) of the life of these bulbs and use it as an estimate for the mean length (parameter μ) of the life for all the bulbs it produces. Frequency DistributionIn this section we talk about representation of data organized in tabular form. Such a representation is called a frequency distribution. We are mostly concerned with numerical data (i.e., quantititative data), but also consider some non-numerical data (i.e., qualitative data). Example. The following is data on the blood group of 72 patients in a hospital:
We have four types of blood groups, namely, O, A, B, AB. Each of these blood groups may be referred to as a "class." The frequency of a class is defined as the number of data members that belong to that class. For example, the frequency of the class O is 31; the frequency of class A is 31. A table that lists the classes and the corresponding frequency is called the frequency distribution of this qualitative data. Following is the frequency distribution of this data:
|
relative frequency of x = | frequency of x
total # of data points |
percentage frequency of x = | frequency of x
total # of data points |
· 100. |
The frequency table may also contain the relative and percentage frequency. Since the data was not grouped, this would be called the frequency distribution of the ungrouped data.
Example 1.1.1 To estimate the mean time taken to complete a three-mile drive by a race car, the race car did several time trials, and the following sample of times taken (in seconds) to complete the laps was collected:
50 | 48 | 49 | 46 | 54 | 53 | 52 | 51 | 47 | 56 | 52 | 51 |
51 | 53 | 50 | 49 | 48 | 54 | 53 | 51 | 52 | 54 | 54 | 53 |
55 | 48 | 51 | 50 | 52 | 49 | 51 | 53 | 55 | 54 | 50 |
Note that there are 35 observations here. So we say that the size of the sample (or data) is 35. Also the values present are 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56. Since there are only 11 distinct values present we can make a frequency table for the ungrouped data. The following is the frequency distribution of this ungrouped data:
Time (in seconds) |
Frequency | Relative Frequency |
Percentage Frequency |
---|---|---|---|
46 | 1 | 1/35 | 2.86 |
47 | 1 | 1/35 | 2.86 |
48 | 3 | 3/35 | 8.57 |
49 | 3 | 3/35 | 8.57 |
50 | 4 | 4/35 | 11.43 |
51 | 6 | 6/35 | 17.14 |
52 | 4 | 4/35 | 11.43 |
53 | 5 | 5/35 | 14.29 |
54 | 5 | 5/35 | 14.29 |
55 | 2 | 2/35 | 5.71 |
56 | 1 | 1/35 | 2.86 |
Total | 35 | 1 | 100 |
When we are working with a large set of data that contains too many distinct values, then we group the whole set of data into a few class intervals and give the corresponding "frequency" of the class. When the data is presented in this way, the data is called grouped data. The number of data members that fall in a class interval is called the class frequency and the relative and percentage frequencies are computed by the same formula as above. A list that gives various class intervals and the corresponding class frequencies in a tabular form is called a class frequency table or class frequency distribution of the data. The frequency distribution may also include the relative and percentage frequencies.
Grouped Data and Loss of Information
Note that when we construct a frequency table of ungrouped data, there is no loss of information. The original data can be reconstructed from the frequency table of ungrouped data. Only loss would be the order in which the original data appeared.
Sometimes it is necessary to group data into class intervals to construct a frequency distribution. This would be the case when there are too many distinct data-values present in the data—too many even to fit into a table in a regular size paper for presentation. In such situations, we group the data in a few class intervals. While class frequency distribution is very good for presentation and may be convenient for other reasons, we lose a lot of information in this process. There would be no way to recover the original data from the class frequency distribution.
To construct a class frequency table of data-set, the first question would be, how many class intervals should we have? The answer is that it should not be too few nor should it be too many. The fewer the number of class intervals, more is the loss of information. In the extreme case, if we use only one class interval, all the information would be lost. On the other hand, if we take too many, we will have the problem of having to work with ungrouped data. (In this course we will always tell you how many classes to take.) Although sometimes it may be necessary to take class intervals of varying width, in this course we only consider classes of equal class width. We follow the following steps to construct a class frequency distribution.
class width = w = | R
Number of classes |
[L,L+w],[L+w,L+2w],[L+2w, L+3w], ...,[H-w,H]
An ambiguous situation may arise, when a data-value falls on a class boundary. Depending on the nature of data or otherwise, we would have to follow a consistent convention whether to count such data members on the left or the right class interval.
A few more important definitions. The above intervals are called class intervals. The w above is called the class size or width. The lower end of the class is called lower limit and the upper end of the class is called upper limit. The class mark is the midpoint of the class, defined as follows:
class mark = | lower limit of class+
upper limit of class
2 |
. |
A class limit is also called a class boundary.
Example 1.1.2 The following is the weight (in ounces), at birth, of a certain number of babies.
74 | 105 | 124 | 110 | 119 | 137 | 96 | 110 | 120 | 115 | 140 |
65 | 135 | 123 | 129 | 72 | 121 | 117 | 96 | 107 | 80 | 91 |
74 | 123 | 124 | 124 | 134 | 78 | 138 | 106 | 130 | 97 | 145 |
93 | 133 | 128 | 96 | 126 | 124 | 125 | 127 | 62 | 127 | 92 |
95 | 118 | 126 | 94 | 127 | 121 | 117 | 124 | 93 | 135 | 156 |
143 | 125 | 120 | 147 | 138 | 72 | 119 | 89 | 81 | 113 | 91 |
133 | 127 | 138 | 122 | 110 | 113 | 100 | 115 | 110 | 135 | 141 |
97 | 127 | 120 | 110 | 107 | 111 | 126 | 132 | 120 | 108 | 148 |
143 | 103 | 92 | 124 | 150 | 86 | 121 | 98 | 74 | 85 | 99 |
We will construct a class frequency table of this data by dividing the whole range of data into class intervals.
Solution: Note that the lowest value is 62 and the highest value is 156. We take L = 60, H = 160, so R = H-W = 100. We made such a choice of L and H, precisely so that R = 100 is a "nice" number. Now we decide to have 5 class intervals and so w = R/5 = 20. According to what I said above, our classes should be : [60, 80], [80,100], [100,120], [120,140], [140, 160]. But if we do so then there is a risk that some data members (like 80, 100, 120, 140) will fall in two classes. To avoid this we add .5 to all the class boundaries. So, our classes are [60.5, 80.5], [80.5, 100.5], [100.5, 120.5], [120.5, 140.5], [140.5, 160.5].
So the frequency distribution is as follows:
Classes | Frequency | Relative Frequency |
Percentage Frequency |
---|---|---|---|
60.5 - 80.5 | 9 | 9/99 | 9.09 |
80.5 - 100.5 | 20 | 20/99 | 20.20 |
100.5 - 120.5 | 25 | 25/99 | 25.26 |
120.5 - 140.5 | 37 | 37/99 | 37.38 |
140.5 - 160.5 | 8 | 8/99 | 8.08 |
Total | 99 | 1 | 100 |
Because we will be using calculators (TI-84) extensively in this course, let me explain how you enter data in the TI-84.
Use of Calculators (TI-84): |
---|
Enter Your Data:
|
It is not easy to construct a frequency table of a data set unless
you are systematic. Traditionally, we used "tally marks" to count the
frequency. Now you can use some software programs (e.g., Excel). Let
me show you a method, using a calculator (TI-84).
|
Exercise 1.1.1 To estimate the mean time taken to complete a
three-mile drive by a race car, the race car did several time trials,
and the following sample of times taken (in seconds) to complete the
laps was collected:
50 | 48 | 49 | 46 | 54 | 53 | 52 | 51 | 47 | 56 | 52 | 51 |
51 | 53 | 50 | 49 | 48 | 54 | 53 | 51 | 52 | 54 | 54 | 53 |
55 | 48 | 51 | 50 | 52 | 49 | 51 | 53 | 55 | 54 | 50 |
The following is the frequency distribution of this ungrouped data:
Time (in seconds) |
Frequency | Relative Frequency |
Percentage Frequency |
---|---|---|---|
46 | 1 | 1/35 | 2.86 |
47 | 1 | 1/35 | 2.86 |
48 | 3 | 3/35 | 8.57 |
49 | 3 | 3/35 | 8.57 |
50 | 4 | 4/35 | 11.43 |
51 | 6 | 6/35 | 17.14 |
52 | 4 | 4/35 | 11.43 |
53 | 5 | 5/35 | 14.29 |
54 | 5 | 5/35 | 14.29 |
55 | 2 | 2/35 | 5.71 |
56 | 1 | 1/35 | 2.86 |
Total | 35 | 1 | 100 |
Exercise 1.1.2. The following is the weight (in ounces), at birth, of 96 babies born in Lawrence Memorial Hospital in May 2000.
94 | 105 | 124 | 110 | 119 | 137 | 96 | 110 | 120 | 115 | 119 |
104 | 135 | 123 | 129 | 72 | 121 | 117 | 96 | 107 | 80 | 80 |
96 | 123 | 124 | 124 | 134 | 78 | 138 | 106 | 130 | 97 | 134 |
111 | 133 | 128 | 96 | 126 | 124 | 125 | 127 | 62 | 127 | 96 |
116 | 118 | 126 | 94 | 127 | 121 | 117 | 124 | 93 | 135 | 112 |
120 | 125 | 120 | 147 | 138 | 72 | 119 | 89 | 81 | 113 | 100 |
109 | 127 | 138 | 122 | 110 | 113 | 100 | 115 | 110 | 135 | 120 |
97 | 127 | 120 | 110 | 107 | 111 | 126 | 132 | 120 | 108 | 148 |
133 | 103 | 92 | 124 | 150 | 86 | 121 | 98 |
Construct a class frequency table of this data by dividing the the
whole range of data into class intervals:
[60.5-70.5], [70.5-80.5], [80.5-90.5], [90.5-100.5], [100.5-110.5], [110.5-120.5], [120.5-130.5], [130.5-140.5], [140.5-150.5]
Exercise 1.1.3. The following are the length (in inches), at birth, of 96 babies born in Lawrence Memorial Hospital in May 2000.
18 | 18.5 | 19 | 18.5 | 19 | 21 | 18 | 19 | 20 | 20.5 |
19 | 19 | 21.5 | 19.5 | 20 | 17 | 20 | 20 | 19 | 20.5 |
18 | 18.5 | 20 | 19.5 | 20.75 | 20 | 21 | 18 | 20.5 | 20 |
21 | 19 | 20.5 | 19 | 20 | 19.5 | 17.75 | 20 | 19.5 | 20 |
20.5 | 17 | 21 | 18.5 | 20 | 20 | 20 | 18.5 | 19.5 | 19 |
18 | 20.5 | 18 | 20 | 19 | 19 | 19.5 | 20 | 20.75 | 21 |
17.75 | 19 | 18 | 19 | 20 | 18.5 | 20 | 19 | 21 | 19 |
19.5 | 20 | 20 | 19 | 19.5 | 20 | 19.5 | 18.5 | 20.5 | 19.5 |
20.25 | 20 | 19.5 | 19.5 | 20 | 20 | 20 | 21 | 20 | 19 |
18.5 | 20.5 | 21.5 | 18 | 19.5 | 18 |
Construct a frequency table for this data by dividing the whole range into class intervals:
[16-17], [17-18], [18-19], [19-20], [20-21], [21-22].
Note: If a data member falls on the boundary, count it in the
right/upper class-interval.
Exercise 1.1.4. The following data represents the number of typos in a sample of 30 books published by some publisher.
156 | 159 | 162 | 160 | 156 | 162 |
159 | 160 | 156 | 156 | 160 | 162 |
156 | 159 | 162 | 156 | 162 | 158 |
160 | 158 | 159 | 162 | 158 | 158 |
162 | 160 | 159 | 162 | 162 | 160 |
Construct a frequency table (by sorting in your calculator).
Exercise 1.1.5. Following is data on the hourly wages (paid only in whole dollars) in an industry.
9 | 11 | 8 | 9 | 10 | 11 | 7 | 10 | 12 | 13 |
7 | 11 | 8 | 11 | 14 | 9 | 10 | 9 | 11 | 7 |
13 | 13 | 14 | 12 | 9 | 8 | 12 | 14 | 15 | 9 |
9 | 7 | 12 | 7 | 12 | 7 | 7 | 11 | 13 | 9 |
11 | 9 | 9 | 9 | 10 | 14 | 11 | 12 | 14 | 7 |
Construct a frequency table (by sorting in your calculator).
Exercise 1.1.6. Following is data on the hourly wages (paid only in whole dollars) of 99 employees in an industry.
7 | 11 | 7 | 11 | 10 | 9 | 10 | 10 | 12 | 13 |
7 | 8 | 11 | 11 | 14 | 9 | 7 | 9 | 11 | 7 |
9 | 13 | 12 | 14 | 7 | 8 | 7 | 14 | 15 | 9 |
9 | 7 | 11 | 9 | 12 | 9 | 12 | 11 | 14 | 9 |
12 | 13 | 7 | 9 | 10 | 14 | 11 | 12 | 13 | 7 |
15 | 15 | 16 | 16 | 15 | 16 | 11 | 7 | 18 | 19 |
15 | 16 | 15 | 15 | 16 | 16 | 17 | 16 | 16 | 13 |
15 | 15 | 16 | 15 | 16 | 15 | 15 | 17 | 16 | 12 |
16 | 15 | 15 | 16 | 15 | 15 | 19 | 8 | 16 | 17 |
16 | 16 | 15 | 16 | 16 | 16 | 13 | 12 | 8 |
Construct a frequency table (by sorting in your calculator).
Another way to represent data is to use pictures and graphs. Such pictorial representations are commonly used in newspapers and other media outlets. Pictorial representation is particularly helpful when you have to represent data to people with limited technical background, like newspaper readers or a governmental or congressional body.
The pie chart is a commonly used pictorial representation of data.
When you do your tax return every year, you find a few pie charts in
the instruction book for form 1040. These charts show what proportion/percentage
of each tax dollar goes for particular expenses. I reproduced the following
pie charts from the 1040 instruction book of 1999.
Among pictorial representations, the most useful in this course is the histogram. The histogram of data is the graphical representation of the frequency distribution of the data, where we plot the variable on the horizontal axis and above each class interval, we erect a bar of the height equal to the frequency of the class. Such a histogram is called a frequency histogram.
If, instead, we erect bars of height equal to the relative frequency, then the graph is called a relative frequency histogram. Similarly, we can construct a percentage frequency histogram.
The following is a histogram.
We have decided to avoid unequal class lengths, which makes our discussion
of the histogram fairly simple.
Remark. Take a look at the Stem and Leaf Diagram discussed in any textbook.
Example 1.1.3. Following is the frequency table of data on height (in inches) of some babies at birth. Sketch the histogram of the following data:
Height | Frequency |
---|---|
16-17 | 3 |
17-18 | 8 |
18-19 | 34 |
19-20 | 60 |
20-21 | 72 |
21-22 | 18 |
For a given value x of a variable, the cumulative frequency of the data, for x, is the number of data members that are less than or equal to x.
Definition. Given a frequency distribution of some data, for a class boundary x, the cumulative frequency is the sum of all the class frequenies less or equal to x. The cumulative frequency distribution is a table that gives the cumulative frequencies against some x values (for us the class boundaries). We also define cumulative relative frequency and cumulative percentage frequency as follows:
cumulative relative frequency of x = |
cumulative frequency
of x
total # of data points |
cumulative percentage frequency of x= | cumulative frequency
total # of data points |
×100 |
Example 1.1.4 Once again we consider the data on birth weight of babies in Example 1.1.2 that we discussed in the last section. A cumulative frequency distribution can be constructed from the frequency distribution.
Solution: We have seen the frequency distribution before. The following is the cumulative distributions:
Weight | Cumulative Frequency |
Relative-Cumulative Frequency |
Cumulative Percentage Frequency |
---|---|---|---|
60.5 | 0 | 0 | 0 |
80.5 | 9 | 9/99 | 9.09 |
100.5 | 29 | 29/100 | 29.29 |
120.5 | 54 | 54/99 | 54.55 |
140.5 | 91 | 91/99 | 91.92 |
160.5 | 99 | 1 | 100 |
Definition. The ogive
is a line graph, where we plot the variable on the horizontal axis and
the cumulative frequency on the vertical axis. If we plot the cumulative
relative frequency on the vertical axis, then the line graph is called
the relative frequency ogive.