Math 365, Elementary Statistics

 

Chapter 1: The Language and Terminology

Satya Mandal

Introductionback to top

Most people understand that statistics is the study of the numerical features of a subject/population. Understanding of the statisticians would not be much different. A statistician would only emphasize on how they do it, in addition. Some statisticians would define statistics as the scientific and mathematical study of the methods of collecting data, summarizing and presenting data, and drawing inferences from data.

American Statistical Association defines statistics follows (click for further updated version): Statistics is the scientific application of mathematical principles to the collection, analysis, and presentation of numerical data. Statisticians contribute to scientific enquiry by applying their mathematical and statistical knowledge to the design of surveys and experiments; the collection, processing, and analysis of data; and the interpretation of the results.

The goal of this course is to learn some commonly used methods to use collected sample data to draw inferences about a population and the mathematical basis behind such methods. The following is an example.

Example. Suppose you want to estimate the mean (average) weight of the fish polulation in the nearest lake. The mean weight is a charecteristic of the whole fish population in the lake. To estimate it, you catch a small sample of fish from the lake. Then compute the mean weight of the sample, to be called the sample mean. Then, declair that this sample mean is an estimate of the mean weight of the whole population (also called the population mean).

Another point about the nature of statistics as a science is that it is not a deterministic science. It does not have laws like force is equal to mass times acceleration. Statements in statistics come with a probability (i.e., quantified chance) of being correct. When a weatherman says that it will rain today he means that there is, say, a ninety five percent chance that it will rain today. Roughly, this means that if he makes the same prediction one hundred times he will be correct 95 times, and it will not rain the other 5 days. The problem is that sometimes a weatherman will hide the information that there is a 95 percent chance only. Such information hiding is sometimes done for simplicity.

Skepticism back to top

Skepticism about statistics is widespread and often justifiably so. It may not be an overstatement to say that statistics is misused and abused on regular basis. To put it sarcastically, abuse of statistics to generate opinion may already be a brunch of science or sociology based on scientific theory and models. The part that is based on scientific models may sometimes be ethically wrong, its scientific validity cannot be denied. Unfortunately, such methods include misleading the public with false data and misinformation.

On Sunday talk shows pundits and the political opinion makers try to justify opposing point of virews, sometimes based on data from respectable sources. It would be a fair question, how could something be a science when it justifies two opposing point of views? While there is no cure for misleading or incorrect information, sometimes both may be statistitically correct with emphasis on the different aspects of the statistical inferences. Following is an example.

Example.

In December 1998, the House of Representatives impeached President Bill Clinton. In February 1999, President was acquitted by the Senate. (In impeachment trial, the house works like the prosecutor and the Senate works like the jury. Search internet for more information).

President Clinton was formally charged with perjury and obstruction of justice. In any case, both stemmed out of allegations of sexual liaison and harassment. During this process of impeachment, there was long political discourse with respect to morality and legalities of the whole episode. Following would be typical discussion on TV.

  1. Clinton critics would cite data and point out that, according to statistics, the majority of Americans think that character matters.
  2. Clinton sympathizers would cite data and point out, according to statistics, that the majority of Americans think the president is doing a good job.

The implication here is that one of them was "wrong." But the science of statistics says that both were correct. Data was collected and analyzed, and it was found that the majority of Americans think that character matters and that the majority of Americans think the president is doing a good job. It does not matter to the science of statistics which one of the statistically established facts one would have desired.

Historically, during the early part of development of statistics, skepticism was of different nature. The validity of scientific foundation was in question. Statistics was compared with astrology, because both do predictions regarding unknown. An anecdote follows. When the proposal to establish the Indian Statistical Institute in Calcutta was considered by the government of India in the early part of the last century, some critics said, then why not an institute in astrology?

At the inception of statistics as a science there was a lot of skepticism about its scientific validity. Those days are gone, and statistics is not likened to astrology any more! Statistics is a well-founded and precise science. It is a nondeterministic science in nature; it makes precise probabilistic statements only.

Descriptive and Inferential Statistics back to top

In this course we will be talking about two branches of statistics. The first one is called descriptive statistics which deals with methods of processing, summarizing, and presenting data. The other part deals with the scientific methods of drawing inferences and forecasting from the data, and is called inferential or inductive statistics.

Course Organization back to top

The course has nine lessons that can be divided into three parts:
  1. Chapter 1 and 2: Descriptive Statistics. TI-84 (Silver Edition) would be used to solve problems.
  2. Chapter 3, 4, 5, 6: Probability and Mathematical Basis. There is no direct TI-84 method for these lessons. However, after explaining the mathematics involved, the DISTR key (menu) of TI-84 (Silver Edition) will be used to compute probability.
  3. Chapter 7, 8, 9: Inferential Statistics or Estimation. The goal of this course is to develop methods to do estimation, which would be accomplished in these lessons. Again, DISTR key (menu) of TI-84 (Silver Edition) will be used heavily.

In the rest of this lesson and the next we deal with descriptive statistics, which include the presentation of data in the form of tables, graphs, and computations of various averages of data.

Basic Definitions and Conceptsback to top

In statistics, we use a small "sample" to make inferences about a "big population". Statistics serves a purpose only when we do not have a way to find full or accurate information about the whole population. Sometimes, the population is such that it is intrinsically impossible to find full and accurate information. The same may be the situation because of the cost associated with full enumeration. The following are some example:

  1. The mean weight of the fish population in the nearest lake. Realistically, it would be impossible to catch all the fish in the lake, measure them and find the mean weight.
  2. You are a quality control inspector in a lamp factory. To give an idea to the consumers, you want to know the mean lifetime (in hours) of the lamps produced. There is no way you can measure the mean before you sell.
  3. The mean annual expenditure (in year 2011) of the KU student population.
  4. Remark. In some cases, inspite of existance of full information regarding the whole population, for cost effectiveness, samples are used to estimate the population. Suppose you want to know the mean GPA of the KU population. Although, KU has the full information, you may not be able to access the full data and then do the computations due to the associated cost. So, you may like to be content with a sample and the sample mean GPA as an estimate.
  5. Remark. Interestingly, advent of computers in abundance caused certain usages of statistics obsolete. Thirty years ago, KU used to keep record in papers. Those days, they would have used sample data to avoid dealing with the huge amount of data in papers.

Population and Sample

Definitions. A complete collection of data on the group under study is called the population or the universe.

A member of the population is called a sampling unit. Therefore, the population consists of all its sampling units.

A Sample is a collection of sampling units selected from the population.

Most often, we will work with numerical characteristics (like height, weight, and salary) of a group. So usually the population is a large collection of numbers and the sample is a small subset of the population.

Example. Suppose we are studying the daily rainfall in Lawrence. Since daily rainfall could be from 0 inches to anything above 0, the population here is all nonnegative numbers (i.e., the interval [0, ∞)). A sample from this population would be the observed amount of daily rainfall in Lawrence on some number of days. A sample of size 11 would be the observed daily rainfall in Lawrence on 11 days.

Variables

A variable is something that varies or changes value. Most often, we consider numerical variables. Numerical variables are also called quantitative variables. Examples of quantitative variables include height, length, weight, number of typos in books, number of credit hours completed by students, number of accidents (or number of anything) and time. Non-numerical variables are also considered. They are called qualitative variables. Examples of qualitative variables include blood group and gender. In fact, any genetic property (genotype or phenotype) is a qualitative variable, because they vary from human to human (or trees to trees).

In Chapter 4, we will have an elaborate discussion on a specific type of variables called random variables, which would be more relevant for our purpose.

Parameters and Statistics

Definition 1. Given a set of data, any numerical value computed from the data using a formula or a rule is called a quantitative measure of the data.

Definition 2. A quantitative measure of a population data is called a parameter. In other words, parameters belong to the whole population and are computed (if feasible) from the WHOLE population data. Examples: the average GPA of all KU students, the height of the tallest student in KU, the average income of the entire KU student population.

One way to study a population is to know some of the parameters of the population. Unfortunately, computing such parameters could be expensive or even impossible. Essentially, parameters are unknown and the main game of statistics is to try to estimate parameters on the basis of small samples collected from the population.

Definition 3. A quantitative measure of a sample data is called a statistic. Any constant that we compute from a sample is a statistic. We use these statistics to estimate the parameters of the population. For example, the average height computed from a sample is a reasonable estimate for the (parameter) average height of the KU student population. Obviously, we do not expect the value of the statistic to be exactly equal to the parameter value. Hopefully, the error will be small or will exceed our tolerable limit very rarely (say once in a 100 trials).

Why do we need a statistic?

Sometimes it will be impossible to know the actual value of a parameter. For example, let μ be the mean length of the life of light bulbs produced by a company. In this case, the company cannot test all the bulbs it produces to find a mean length. So, the best that that we can do is to we test a few bulbs (the sample), compute the sample mean length (a statistic) of the life of these bulbs and use it as an estimate for the mean length (parameter μ) of the life for all the bulbs it produces.

Frequency Distributionback to top

In this section we talk about representation of data organized in tabular form. Such a representation is called a frequency distribution. We are mostly concerned with numerical data (i.e., quantititative data), but also consider some non-numerical data (i.e., qualitative data).

Example. The following is data on the blood group of 72 patients in a hospital:

O A O B O A O A A O A O B A O A A O
O A O A B B O O AB O A O O A O A O A
A A A O A A O AB B O A O B A A A O O
O A A B A O O O O A A B O O O A A A

We have four types of blood groups, namely, O, A, B, AB. Each of these blood groups may be referred to as a "class." The frequency of a class is defined as the number of data members that belong to that class. For example, the frequency of the class O is 31; the frequency of class A is 31. A table that lists the classes and the corresponding frequency is called the frequency distribution of this qualitative data. Following is the frequency distribution of this data:

Blood Group Frequency
O 31
A 31
B 8
AB 2
Total 72


Ungrouped Data

For the quantitative data, we consider two types of frequency table. When we are working with a large set of data we group that data into a few classes and construct a "frequency table," which we will discuss later.

If the data set contains only a few distinct values then data would not be grouped. We make a list of all the data-values present and give the corresponding frequency for each data-value in a table. The number of times a data-value appears in the data set is called the frequency of the data member. A list that presents the data members and the corresponding frequency in a tabular form is called a frequency table or frequency distribution. The relative frequency and percentage frequency of a data member x are defined as follows:


relative frequency of x = frequency of x
total # of data points
and
percentage  frequency of x = frequency of x
total # of data points
· 100.

The frequency table may also contain the relative and percentage frequency. Since the data was not grouped, this would be called the frequency distribution of the ungrouped data.

Example 1.1.1 To estimate the mean time taken to complete a three-mile drive by a race car, the race car did several time trials, and the following sample of times taken (in seconds) to complete the laps was collected:

50 48 49 46 54 53 52 51 47 56 52 51
51 53 50 49 48 54 53 51 52 54 54 53
55 48 51 50 52 49 51 53 55 54 50  

Note that there are 35 observations here. So we say that the size of the sample (or data) is 35. Also the values present are 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56. Since there are only 11 distinct values present we can make a frequency table for the ungrouped data. The following is the frequency distribution of this ungrouped data:

Time
(in seconds)
Frequency Relative
Frequency
Percentage
Frequency
46 1 1/35 2.86
47 1 1/35 2.86
48 3 3/35 8.57
49 3 3/35 8.57
50 4 4/35 11.43
51 6 6/35 17.14
52 4 4/35 11.43
53 5 5/35 14.29
54 5 5/35 14.29
55 2 2/35 5.71
56 1 1/35 2.86
Total 35 1 100

Grouped Data

When we are working with a large set of data that contains too many distinct values, then we group the whole set of data into a few class intervals and give the corresponding "frequency" of the class. When the data is presented in this way, the data is called grouped data. The number of data members that fall in a class interval is called the class frequency and the relative and percentage frequencies are computed by the same formula as above. A list that gives various class intervals and the corresponding class frequencies in a tabular form is called a class frequency table or class frequency distribution of the data. The frequency distribution may also include the relative and percentage frequencies.

Grouped Data and Loss of Information

Note that when we construct a frequency table of ungrouped data, there is no loss of information. The original data can be reconstructed from the frequency table of ungrouped data. Only loss would be the order in which the original data appeared.

Sometimes it is necessary to group data into class intervals to construct a frequency distribution. This would be the case when there are too many distinct data-values present in the data—too many even to fit into a table in a regular size paper for presentation. In such situations, we group the data in a few class intervals. While class frequency distribution is very good for presentation and may be convenient for other reasons, we lose a lot of information in this process. There would be no way to recover the original data from the class frequency distribution.

Steps to Construct Frequency Distribution

To construct a class frequency table of data-set, the first question would be, how many class intervals should we have? The answer is that it should not be too few nor should it be too many. The fewer the number of class intervals, more is the loss of information. In the extreme case, if we use only one class interval, all the information would be lost. On the other hand, if we take too many, we will have the problem of having to work with ungrouped data. (In this course we will always tell you how many classes to take.) Although sometimes it may be necessary to take class intervals of varying width, in this course we only consider classes of equal class width. We follow the following steps to construct a class frequency distribution.

  1. Range: Pick a suitable number L less than or equal to the smallest value present in the data. Pick a suitable number H greater than or equal to the highest value present in the data. The range R that we consider is R = H - L.
  2. Number of Classes: Decide on a suitable number of classes. (In this course we will tell you the number of classes.)
  3. Class Width: We have
    class width = w = R
    Number of classes

    We will pick L, H, and the number of classes so that class width is a "round number."
  4. Classes: We divide our interval [L,H] into subintervals, to be called classes, as

    [L,L+w],[L+w,L+2w],[L+2w, L+3w], ...,[H-w,H]

  5. Frequency: Find the frequency for each of the classes. You can use an advanced calculator or some software (like Excel) to count frequencies.

    An ambiguous situation may arise, when a data-value falls on a class boundary. Depending on the nature of data or otherwise, we would have to follow a consistent convention whether to count such data members on the left or the right class interval.

A few more important definitions. The above intervals are called class intervals. The w above is called the class size or width. The lower end of the class is called lower limit and the upper end of the class is called upper limit. The class mark is the midpoint of the class, defined as follows:


class mark = lower limit of class+ upper limit of class
2
.

A class limit is also called a class boundary.

Example 1.1.2 The following is the weight (in ounces), at birth, of a certain number of babies.

74 105 124 110 119 137 96 110 120 115 140
65 135 123 129 72 121 117 96 107 80 91
74 123 124 124 134 78 138 106 130 97 145
93 133 128 96 126 124 125 127 62 127 92
95 118 126 94 127 121 117 124 93 135 156
143 125 120 147 138 72 119 89 81 113 91
133 127 138 122 110 113 100 115 110 135 141
97 127 120 110 107 111 126 132 120 108 148
143 103 92 124 150 86 121 98 74 85 99

We will construct a class frequency table of this data by dividing the whole range of data into class intervals.

Solution: Note that the lowest value is 62 and the highest value is 156. We take L = 60, H = 160, so R = H-W = 100. We made such a choice of L and H, precisely so that R = 100 is a "nice" number. Now we decide to have 5 class intervals and so w = R/5 = 20. According to what I said above, our classes should be : [60, 80], [80,100], [100,120], [120,140], [140, 160]. But if we do so then there is a risk that some data members (like 80, 100, 120, 140) will fall in two classes. To avoid this we add .5 to all the class boundaries. So, our classes are [60.5, 80.5], [80.5, 100.5], [100.5, 120.5], [120.5, 140.5], [140.5, 160.5].

So the frequency distribution is as follows:

Classes Frequency Relative
Frequency
Percentage
Frequency
60.5 - 80.5 9 9/99 9.09
80.5 - 100.5 20 20/99 20.20
100.5 - 120.5 25 25/99 25.26
120.5 - 140.5 37 37/99 37.38
140.5 - 160.5 8 8/99 8.08
Total 99 1 100


Use of Calculators

Because we will be using calculators (TI-84) extensively in this course, let me explain how you enter data in the TI-84.

Use of Calculators (TI-84):
Enter Your Data:
  1. Press the button "stat."
  2. Select "Edit" in the Edit menu and enter.
  3. You will find 6 lists named L1, L2, L3, L4, L5, L6.
  4. Let's say you want to enter your data in L1. If L1 has some data, you clear it by pressing the stat button and selecting ClrList in the Edit menu. ClrList appears then type L1 and hit enter. To type "L1" on your TI-84 simply press 2nd then 1.
  5. Once L1 is cleared, you select Edit in the Edit menu and enter.
  6. Now type in your data; enter one by one.

It is not easy to construct a frequency table of a data set unless you are systematic. Traditionally, we used "tally marks" to count the frequency. Now you can use some software programs (e.g., Excel). Let me show you a method, using a calculator (TI-84).

  1. Press "stat."
  2. To input data, enter "edit."
  3. Enter your data (say in L1).
  4. Press "stat."
  5. Enter "sortA" L1.
  6. Press "stat" and then enter "edit." On L1 you will see that the data is sorted in an increasing order.
  7. Now you can count the frequencies.

Problems on 1.2: Frequency Distributionback to top


Exercise 1.1.1
To estimate the mean time taken to complete a three-mile drive by a race car, the race car did several time trials, and the following sample of times taken (in seconds) to complete the laps was collected:

50 48 49 46 54 53 52 51 47 56 52 51
51 53 50 49 48 54 53 51 52 54 54 53
55 48 51 50 52 49 51 53 55 54 50  

The following is the frequency distribution of this ungrouped data:

Time
(in seconds)
Frequency Relative
Frequency
Percentage
Frequency
46 1 1/35 2.86
47 1 1/35 2.86
48 3 3/35 8.57
49 3 3/35 8.57
50 4 4/35 11.43
51 6 6/35 17.14
52 4 4/35 11.43
53 5 5/35 14.29
54 5 5/35 14.29
55 2 2/35 5.71
56 1 1/35 2.86
Total 35 1 100

Exercise 1.1.2. The following is the weight (in ounces), at birth, of 96 babies born in Lawrence Memorial Hospital in May 2000.

94 105 124 110 119 137 96 110 120 115 119
104 135 123 129 72 121 117 96 107 80 80
96 123 124 124 134 78 138 106 130 97 134
111 133 128 96 126 124 125 127 62 127 96
116 118 126 94 127 121 117 124 93 135 112
120 125 120 147 138 72 119 89 81 113 100
109 127 138 122 110 113 100 115 110 135 120
97 127 120 110 107 111 126 132 120 108 148
133 103 92 124 150 86 121 98

Construct a class frequency table of this data by dividing the the whole range of data into class intervals:

[60.5-70.5], [70.5-80.5], [80.5-90.5], [90.5-100.5], [100.5-110.5], [110.5-120.5], [120.5-130.5], [130.5-140.5], [140.5-150.5]

Exercise 1.1.3. The following are the length (in inches), at birth, of 96 babies born in Lawrence Memorial Hospital in May 2000.

18 18.5 19 18.5 19 21 18 19 20 20.5
19 19 21.5 19.5 20 17 20 20 19 20.5
18 18.5 20 19.5 20.75 20 21 18 20.5 20
21 19 20.5 19 20 19.5 17.75 20 19.5 20
20.5 17 21 18.5 20 20 20 18.5 19.5 19
18 20.5 18 20 19 19 19.5 20 20.75 21
17.75 19 18 19 20 18.5 20 19 21 19
19.5 20 20 19 19.5 20 19.5 18.5 20.5 19.5
20.25 20 19.5 19.5 20 20 20 21 20 19
18.5 20.5 21.5 18 19.5 18

Construct a frequency table for this data by dividing the whole range into class intervals:

[16-17], [17-18], [18-19], [19-20], [20-21], [21-22].

Note: If a data member falls on the boundary, count it in the right/upper class-interval.

Exercise 1.1.4. The following data represents the number of typos in a sample of 30 books published by some publisher.

156 159 162 160 156 162
159 160 156 156 160 162
156 159 162 156 162 158
160 158 159 162 158 158
162 160 159 162 162 160

Construct a frequency table (by sorting in your calculator).

Exercise 1.1.5. Following is data on the hourly wages (paid only in whole dollars) in an industry.

9 11 8 9 10 11 7 10 12 13
7 11 8 11 14 9 10 9 11 7
13 13 14 12 9 8 12 14 15 9
9 7 12 7 12 7 7 11 13 9
11 9 9 9 10 14 11 12 14 7

Construct a frequency table (by sorting in your calculator).

Exercise 1.1.6. Following is data on the hourly wages (paid only in whole dollars) of 99 employees in an industry.

7 11 7 11 10 9 10 10 12 13
7 8 11 11 14 9 7 9 11 7
9 13 12 14 7 8 7 14 15 9
9 7 11 9 12 9 12 11 14 9
12 13 7 9 10 14 11 12 13 7
15 15 16 16 15 16 11 7 18 19
15 16 15 15 16 16 17 16 16 13
15 15 16 15 16 15 15 17 16 12
16 15 15 16 15 15 19 8 16 17
16 16 15 16 16 16 13 12 8  

Construct a frequency table (by sorting in your calculator).



Pictorial Representation of Databack to top

Another way to represent data is to use pictures and graphs. Such pictorial representations are commonly used in newspapers and other media outlets. Pictorial representation is particularly helpful when you have to represent data to people with limited technical background, like newspaper readers or a governmental or congressional body.


The Pie Chart

The pie chart is a commonly used pictorial representation of data. When you do your tax return every year, you find a few pie charts in the instruction book for form 1040. These charts show what proportion/percentage of each tax dollar goes for particular expenses. I reproduced the following pie charts from the 1040 instruction book of 1999.

a pie chart showing how tax dollar outlays are distributed a pie chart showing how tax dollar income is distributed
Pie charts are self explanatory; we will not discuss them any further.

The Histogram

Among pictorial representations, the most useful in this course is the histogram. The histogram of data is the graphical representation of the frequency distribution of the data, where we plot the variable on the horizontal axis and above each class interval, we erect a bar of the height equal to the frequency of the class. Such a histogram is called a frequency histogram.

If, instead, we erect bars of height equal to the relative frequency, then the graph is called a relative frequency histogram. Similarly, we can construct a percentage frequency histogram.

The following is a histogram.

a histogram

We have decided to avoid unequal class lengths, which makes our discussion of the histogram fairly simple.

Remark. Take a look at the Stem and Leaf Diagram discussed in any textbook.

Example 1.1.3. Following is the frequency table of data on height (in inches) of some babies at birth. Sketch the histogram of the following data:

Height Frequency
16-17 3
17-18 8
18-19 34
19-20 60
20-21 72
21-22 18

The Cumulative Frequency Distributions

For a given value x of a variable, the cumulative frequency of the data, for x, is the number of data members that are less than or equal to x.

Definition. Given a frequency distribution of some data, for a class boundary x, the cumulative frequency is the sum of all the class frequenies less or equal to x. The cumulative frequency distribution is a table that gives the cumulative frequencies against some x values (for us the class boundaries). We also define cumulative relative frequency and cumulative percentage frequency as follows:



cumulative relative frequency of x =
cumulative frequency of x
total # of data points

cumulative percentage frequency of x= cumulative frequency
total # of data points
×100

Example 1.1.4 Once again we consider the data on birth weight of babies in Example 1.1.2 that we discussed in the last section. A cumulative frequency distribution can be constructed from the frequency distribution.

Solution: We have seen the frequency distribution before. The following is the cumulative distributions:

Weight Cumulative
Frequency
Relative-Cumulative
Frequency
Cumulative
Percentage
Frequency
60.5 0 0 0
80.5 9 9/99 9.09
100.5 29 29/100 29.29
120.5 54 54/99 54.55
140.5 91 91/99 91.92
160.5 99 1 100


The Ogive

Definition. The ogive is a line graph, where we plot the variable on the horizontal axis and the cumulative frequency on the vertical axis. If we plot the cumulative relative frequency on the vertical axis, then the line graph is called the relative frequency ogive.

relative frequency ogive graph