Site hosted by Angelfire.com: Build your free website today!

Chapter 3 Notes

Chapter 3 is your introduction to descriptive statistics. These are numbers that describe our data. Basically there are just two ideas in this chapter. They are measures of central tendency and measures of variability. We'll tackle each of these separately.

MEASURES OF CENTRAL TENDENCY

Measures of central tendency seek to describe the typical score in a distribution, that is, where the scores tend to cluster. There are three measures of central tendency used by statisticians.

Mean

The mean is the most commonly used measure of central tendency. Mean is simply the arithmetic average of scores.

The usual symbol for mean is the following: , which we usually pronounce "bar X." When you write it, just make an X with a line over the top. This is a sample mean, the mean of a sample drawn from a population. Since we usually don't know the population mean (that's why we're studying a sample, to get an idea about the population), the sample mean is what we usually use.

There are times when we know the population mean; and it has a different symbol, . This silly looking thing is the Greek letter, mu (pronounced "mew"). When you write it, think of it as a u with the tail on the wrong side.

Now, you've probably been computing means for years; when you're trying to figure out how your history grade looks, you may have added up all your test scores and divided by the number of tests to get your average. This is the procedure for computing a mean; and it is a very useful tool. There are different formulas for computing mean for different situations, but they all amount to adding up the scores and dividing by the number of scores (N). Let's start.

Raw Scores. Once more, let's look at the statistics test scores you've been beating to death ever since Chapter 1. Here they are again, in an ordered array. (Remember, this is scores arranged in order from highest to lowest.)

96

84

76

72

96

84

76

64

92

80

76

64

92

80

72

60

88

80

72

56

88

76

72

44

Well, you already know what to do to get the mean: add up all these numbers and divide by 24 (which is the number of scores here, or N). There is a formula for this process; it is as follows:

So what does this formula mean--especially that thing that sort of looks like an E? That thing, , is Sigma, another Greek letter, this time an upper case one. It means "add up everything that follows." So means "add up all the Xs (scores)." The rest of the formula tells you to divide by N--just what you already knew.

It's very handy to realize that mathematical formulas are not mysterious secret code. They're nothing more than sets of instructions for finding things out with numbers. Instead of trying to memorize steps for all the computations you'll perform during this course, learn to let the formulas tell you what to do and what they're for. If you do this, you'll have much less to memorize (and forget right after the test--if it lasts that long) and a whole lot more understanding of what's going on. Another word on formulas: don't even try to memorize these. The relevant ones will be provided to you with each test on a tear-off sheet you can use for reference throughout the test.

Now, if we use this formula to compute the mean of our test scores, we end up like this:

So the mean, or arithmetic average of scores is 76.7. This number represents the arithmetic balancing point of the array of scores--one kind of center. If you were to figure out the difference from the mean (called deviation) for each score in the array, then total these deviations, the total would be 0. So the sum of all the deviations of scores above the mean equals the sum of all the deviations of scores below it. That's the mathematical center of the array.

Ungrouped Frequency Distribution. Now if we look at the statistics test scores in a frequency distribution, it's clear we can't just follow our general method of adding up all the scores and dividing by N. Let's see why and what we can do about it. Here's the frequency distribution we set up in Chapter 2 for these scores.

X

f

 

 

 

 

 

 

96

2

 

 

 

 

 

 

92

2

 

 

 

 

 

 

88

2

 

 

 

 

 

 

84

2

 

 

 

 

 

 

80

3

 

 

 

 

 

 

76

4

 

 

 

 

 

 

72

4

 

 

 

 

 

 

64

2

 

 

 

 

 

 

60

1

 

 

 

 

 

 

56

1

 

 

 

 

 

 

44

1

 

 

 

 

 

 

 

24

 

 

 

 

 

 

Now, if we add up all the scores in the X column, we get 812. Dividing that by 24, we get 33.8. Just eyeballing things, we know this can't be right: the average score just can't be lower than the lowest one! What went wrong? Well, we only added up 11 numbers, then divided by 24. That won't work. How about if we divide by 11, the number of scores we added up? Then we get 73.8--closer, but we know it isn't right according to the average we did on the original array from which this frequency distribution came. What's going wrong?

Look carefully at these scores and the frequency column. With what we just tried, we haven't allowed for the fact that some of these scores appear more than once in our distribution and didn't account for the fact that some appear more than others.; we did allow for this in the original calculation of mean because we added them up one by one. We need a way to account for the differing frequency of each score.

One way to accomplish this is to do just what we did when we computed mean from the original array--add 96 twice, 92 twice, etc. But that takes a long time. An alternative way to accomplish the same thing with much less work is to figure that 96 appears twice, so multiply 96 by 2 (its frequency), then figure out that 92 appears twice and multiply 92 by 2 (its frequency), and so on, then add all these multiplied numbers together. Gives the same outcome as adding each number in as many times as it occurs, but takes less time and trouble.

Look at the result:

X

f

fX

 

 

 

 

 

96

2

192

 

 

 

 

 

92

2

184

 

 

 

 

 

88

2

176

 

 

 

 

 

84

2

168

 

 

 

 

 

80

3

240

 

 

 

 

 

76

4

304

 

 

 

 

 

72

4

288

 

 

 

 

 

64

2

128

 

 

 

 

 

60

1

60

 

 

 

 

 

56

1

56

 

 

 

 

 

44

1

44

 

 

 

 

 

 

24

1840

 

 

 

 

 

fX at the head of the third column simply tells us that these numbers are the result of multiplying each score (X) times its frequency (f). Now look at our total in this column; it's 1840, the same total we got when we just added up the scores in the array. Now we know we're on the right track. And, if you think about it, this makes sense because multiplying 96 by 2 before adding it into the total is just the same as adding in 96 twice. This method gives rise to a modification of our original formula for computing mean, intended especially for use with a frequency distribution.

Now don't try to memorize when to use this formula and when to use the original one; it's a matter of letting the formula tell you what it's for. The original formula has no f, so can't be any use when working with a frequency distribution. This one, on the other hand, has f, so it wouldn't be much help with ordered arrays. The formula gives you a complete set of instructions for finding mean, this time from a frequency distribution. It tells you to multiply each score (X) by its frequency (f), then add up all the resulting numbers and divide the total by the number of cases (N).

Here are our computations for the mean of the frequency distribution:

Grouped Frequency Distribution. Now, let's take a look at the same scores in the grouped frequency distribution from Chapter 2. Here they are:

X (i=10)

f

 

 

 

 

 

 

 

90-99

4

 

 

 

 

 

 

 

80-89

7

 

 

 

 

 

 

 

70-79

8

 

 

 

 

 

 

 

60-69

3

50-59

1

 

 

 

 

 

 

 

40-49

1

 

 

 

 

 

 

 

 

24

 

 

 

 

 

 

 

Once again, we're still trying to find a way to, in effect, add up all the scores and divide by N. We do have a new problem now; what on earth do we use for a score in each class interval? We can't add 90-99 plus 80-89; no one ever taught us the mathematical technique for doing that. That's because there isn't one. What we need is a single number to represent each class interval, so we can add these up instead of adding up 90-99 and 80-89, etc.

Do you remember in Chapter 2, we talked about something that would serve our purpose? Think--a number that represents the class interval. Right! Here's a use for the midpoint. If we find the midpoint for each class interval, we can treat these as representatives for each interval and use them to compute mean. Here they are:

X (i=10)

f

m

 

 

 

 

 

 

90-99

4

94.5

 

 

 

 

 

 

80-89

7

84.5

 

 

 

 

 

 

70-79

8

74.5

 

 

 

 

 

 

60-69

3

64.5

 

 

 

 

 

 

50-59

1

54.5

 

 

 

 

 

 

40-49

1

44.5

 

 

 

 

 

 

 

24

 

 

 

 

 

 

 

And now, having the midpoints identified, we can proceed to operate just as we did for the ungrouped frequency distribution; accounting for the fact that each class interval (represented by a midpoint) occurs with a different frequency. So let's multiply midpoints by frequencies, add up the resulting numbers, and divide by N.

X (i=10)

f

m

mf

 

 

 

 

 

90-99

4

94.5

378

 

 

 

 

 

80-89

7

84.5

591.5

 

 

 

 

 

70-79

8

74.5

596

 

 

 

 

 

60-69

3

64.5

193.5

 

 

 

 

 

50-59

1

54.5

54.5

 

 

 

 

 

40-49

1

44.5

44.5

 

 

 

 

 

 

24

 

1858.0

 

 

 

 

 

Note that this total isn't exactly the same as the totals we've been getting from this set of data. Because using a midpoint carries the same assumption as other estimations from grouped data, that is, that the cases are evenly distributed among possible scores in each class interval, we are estimating mean now, not precisely computing it. Because the assumption may be (and usually is, at least a little) wrong, the answers we now get are not exactly right; they're estimates only. Estimates are the best we can do with grouped data--one of those trade-offs we've been talking about. Here we're buying simplicity with precision--that is, we pay some precision for simplicity.

Now we can modify the formula for mean once more, this time to accommodate the use of a midpoint instead of actual scores in our computation by replacing X with m:

This formula tells us to do just what we've done. We multiplied midpoint by frequency for each interval; added up all the resulting numbers, then we divided by N. Here are the computations for the grouped frequency distribution:

This estimated mean isn't too different from the one we computed using exact numbers; that was 76.7--so this is a pretty good estimate.

It is important to keep in mind that all three of these methods for computing mean are really just variations on the same theme; they all involve adding up the scores and dividing by N. We introduced modifications simply to make it easy to accommodate frequency distributions--both ungrouped and grouped. But the basic procedure never really changes. Now let's look at other ways of viewing central tendency.

Why use a mean? The mean has a number of advantages as a measure of central tendency. First, it is well known and well understood by many people. This means we don't have to spend too much time explaining what it is when we use it to describe our data. It is also relatively easy to compute. In addition, the mean is the basis for many other statistical procedures. Two of these are discussed later in this chapter; they are variance and standard deviation.

When is the mean not such a good idea? There is one big drawback to the mean as a measure of central tendency; that is its sensitivity to extreme values. By this we mean that just one or two scores that are far away from the others (called outliers) can easily pull the mean off-kilter, making it no longer such a great indicator of a typical case.

Let's look at an example. Suppose you're interested in incomes of families in a South Dakota town. You study a sample of 5. Of course, you probably realize a sample of 5 is a pretty small one, but I want you to be able to do this math quickly, so I don't want you to have a huge string of numbers to work with. Here are the incomes of the 5 families:

$1,658,000

$18,000

$16,000

$12,000

$11,000

Now, what would you say is a typical value here? Clearly, what seems typical is an income in the mid teens. So what is the mean of these incomes? Figure it out.

Wow! This mean isn't typical of anyone! It's way above--in fact, more than 20 times as much as--what you or I would have called typical for these families; and it's way too low for the other family. Here's the problem with this: if we were looking at this neighborhood to decide whether services for lower income families were needed, we could conclude, based on mean income, that no such services were needed. That's not a very accurate picture of the neighborhood at all. What caused this? It's the nature of means that a single outlier (the 1.6 million) knocks the whole thing out of whack.

Here's another example, using scores on a second-grade English test:

88

86

84

83

2

Here, a typical score looks like something in the 80s. Looking at these scores, I'm guessing overall the class really understood the lesson tested, although I'm pretty worried about that one student at the bottom. But it would be accurate to say that the class did pretty well as a whole. The mean score, however, is 68.6, in the D range--not so great. The outlier misleads us about how the class, as a whole, did.

How do we solve this problem built into the mean? By using a different measure of central tendency when we have outliers. We need a measure that isn't so influenced by them.

Median

Median is just such a measure. The median represents another kind of center in a set of data; it is the score, above and below which is an equal number of scores. So to find the median, you need to identify the score which has exactly the same numbers of scores above it and below it.

In order to find the median, the first step must always be figuring out how many scores away from one end (top or bottom end) of the scores is the center. Then what you do is count up from the bottom (or down from the top) to find the center. I call this magic number the count-up number. You find this count-up number based on the total number of scores you have, N. Here's the formula:

Here's how you find the count-up number for the statistics test scores we've been working with:

Raw Data. Once you have this count-up number, you simply count up from the bottom of the ordered array (or down from the top--gives you the same answer) and mark your stopping place. Here I've marked it with a check mark; note that it falls between two scores:

96

96

92

92

88

88

84

84

80

80

80

76

76

76

76

72

72

72

72

64

64

60

56

44

Since these two scores are both 76, the median is also 76.

If the check mark were to fall between two different scores, say 58 and 60, then the median is the midpoint between the two. Remember how to find midpoint:

Since here our limits are the scores which surround the check mark, we use 58 and 60 in the formula:

So the median of this set of data is 59.

When you have an odd number of scores, then the count-up number will be a whole number. In that case, all you have to do is count up to the middle score; your check mark will land right on a score. That score is your median; you don't need to fool around with midpoints at all.

Ungrouped Frequency Distribution. The process is really no different with a frequency distribution; you simply count up through the frequency column, adding up the frequencies until you identify the center score. Here's how it is done.

X

f

 

 

 

 

 

 

96

2

 

 

 

 

 

 

92

2

 

 

 

 

 

 

88

2

 

 

 

 

 

 

84

2

 

 

 

 

 

 

80

3

 

 

 

 

 

76

4

 

 

 

 

 

 

72

4

 

 

 

 

 

 

64

2

 

 

 

 

 

 

60

1

 

 

 

 

 

 

56

1

 

 

 

 

 

 

44

1

 

 

 

 

 

You add frequencies until you get to or past 12.5. 1+1+1+2+4=9; and 1+1+1+2+4+4=13. Since adding up the frequencies of the bottom 5 scores is smaller than 12.5, the median score is the sixth score from the bottom. That's the frequency that puts you "over the top," so we know the score on this line, 76, is the median.

If you count up the frequency distribution to a point exactly between two different scores; find a midpoint between them, just as you do for raw data.

Grouped Frequency Distributions. It should come as no surprise to you that finding the median in a grouped distribution is a little more complicated. Everything seems to be more complicated with a grouped distribution.

You start out just as you do for an ungrouped distribution; count up to the class interval that contains the median, just as you counted up to the score that is the median in an ungrouped distribution. 1+1+3=5 and 1+1+3+8=13. Adding in the frequency of the 70-79 interval takes us over 12.5. This is how we know the 70-79 interval contains the median; we call this interval the critical interval. It is marked with a check mark below:

 

X (i=10)

f

fc

 

 

 

 

 

 

90-99

4

24

 

 

 

 

 

 

80-89

7

20

 

 

 

 

 

70-79

8

13

 

 

 

 

 

 

60-69

3

5

 

 

 

 

 

 

50-59

1

2

 

 

 

 

 

 

40-49

1

1

 

 

 

 

 

 

 

24

 

 

 

 

 

 

Once you've identified the critical interval, you once again need to apply a formula to identify just where within the class interval the median is estimated to be. The formula follows:

All the usual stuff is true of using this formula; all the numbers you plug in here apply to the critical interval except fcb, which applies to the interval below the critical interval. Here we go. LRL=69.5; N=24; fcb=5; f=8; i=10, so:

 

Why use Median? Well, the primary reason is that median isn't so sensitive to outliers. Find the median income in the examples used above:

 

$1.658,000

 

 

 

 

$18,000

 

 

 

$16,000

 

 

 

 

$12,000

 

 

 

 

$11,000

 

 

 

The median gives a much better idea of a typical income in the neighborhood. Try the same thing with the other example that follows this one. You'll see that median is an excellent choice in circumstances where outliers can be expected to pull the mean away from center. This should explain why, when news reports are talking about incomes, they nearly always refer to the median income of a group, hardly ever the mean income. Especially with incomes, median is a fairer reflection of the sample studied.

Mode

The mode is the easiest of the measures of central tendency to find. It is the most frequent score. It is determined simply by inspecting the data to find the most frequently occurring score. In the statistics scores we've been looking at, scan the frequency column to find the highest number:

X

f

 

 

 

 

 

 

96

2

 

 

 

 

 

 

92

2

 

 

 

 

 

 

88

2

 

 

 

 

 

 

84

2

 

 

 

 

 

 

80

3

 

 

 

 

 

 

76

4

 

 

 

 

 

 

72

4

 

 

 

 

 

 

64

2

 

 

 

 

 

 

60

1

 

 

 

 

 

 

56

1

 

 

 

 

 

 

44

1

 

 

 

 

 

 

 

24

 

 

 

 

 

 

The highest frequency seen is 4, so the mode will be the score which occurred 4 times. Since there are 2 such scores in this distribution, the distribution is said to be bimodal, and both scores are reported as modes. So the modes for the distribution are 72 and 76.

You should know that sometimes statisticians use the term, mode, loosely. When they graph a data distribution and see two or three definite frequency peaks, they'll often refer to all of them as modes, even if one is actually the highest (therefore the most frequent) and the other(s) lower (so not true modes). So if you see talk of a bimodal or trimodal distribution when this isn't strictly true, this is probably what's going on. It's good to know this, but relax about knowing what to do on a test: if I ask you to write the mode of a distribution, I'm referring to the true mode, the most frequent score(s), not these loose definitions. So only list two or three modes if you have two or three identical frequencies.

When do we use Mode? Mode is especially useful when the distribution of scores reveals more than one mode. This situation may point out to us that our sample is unusual in some way that merits attention before we analyze data further. It also can cause both mean and median to be less accurate descriptions of central tendency for a sample. It's good to be aware of this when using mean and median.

We also use mode when working with nominal scale data. Here, since a frequency and percentage distribution are about as far as we can go in descriptive statistics, the modal category is a helpful descriptive statistic. Mean and median cannot be computed for nominal scale data.

Well, that does it for measures of central tendency. Remember that these are indicators of a typical score in a distribution, one around which the others cluster.

MEASURES OF VARIABILITY

Measures of variability do just the opposite of measures of central tendency. Instead of telling us where scores cluster, they give us an indication of the spread of scores out from a center point. Measures of variability tell us how much and how far the scores spread out from this typical score. There are several measures of variability used by statisticians.

Range

Range is one of these, and probably the least useful. Range is simple to compute; it is simply the distance from the highest score to the lowest score:

range = highest score - lowest score

So for our statistics test scores:

X

f

 

 

 

 

 

 

96

2

 

 

 

 

 

 

92

2

 

 

 

 

 

 

88

2

 

 

 

 

 

 

84

2

 

 

 

 

 

 

80

3

 

 

 

 

 

 

76

4

 

 

 

 

 

 

72

4

 

 

 

 

 

 

64

2

 

 

 

 

 

 

60

1

 

 

 

 

 

 

56

1

 

 

 

 

 

 

44

1

 

 

 

 

 

 

 

24

 

 

 

 

 

 

the highest score is 96 and the lowest score is 44. Computing range is fairly straightforward:

range = 96 - 44 = 52

So the range is 52.

This looks like an ideal measurement of variability--easy to compute, clear what it means. There is a problem with range, however; like the mean it is sensitive to outliers. In fact, it is entirely determined by outliers. I think your textbook authors said it best on page 75 when they tell us that the range describes a distribution "in terms of its two weirdest scores." Look at this example of test scores:

98

76

75

73

72

70

69

67

32

Now, clearly a typical score is in the low to mid-70s, and the range of almost everyone's scores is from 67 to 76--only 9 points. But when we compute range, look what we get:

range = 98 - 32 = 66

Makes it look like the class was all over the map with their grades. Not true at all; except for one very good score and one very low one, everyone else did about the same. See how the range does indeed describe a distribution in terms of its two weirdest scores?

Interquartile Range

Now we can eliminate almost all of this problem with a different kind of range. It is called the interquartile range. You may remember reading in Chapter 2 about quartiles. Quartiles divide a set of data into four groups, a top fourth, a top-middle fourth, a bottom-middle fourth, and a bottom fourth. Each of these is called a quartile, and they are identified by finding the 25th, 50th, and 75th percentiles. These points divide the group into equal quarters, or quartiles. Then the interquartile range is computed by finding the range between the third quartile (or 75th percentile) and the first quartile (or 25th percentile). This lops off the top fourth and the bottom fourth, and along with these, the outliers. Now we're defining range in terms of the middle half of the data set--a much more reliable indicator of variability.

Too bad almost no one uses interquartile range for anything. You need not know how to compute this.

Variance

Variance is very commonly used. There are a few reasons for this. One is that it is based on the mean, which is also very commonly used. Another is that it is the basis for the also common standard deviation. Another is that it is used in performing a bunch of other statistical procedures, many of which we'll run across later in the course. And probably the best reason is that it offers a very useful measure of the variability of scores in a data set.

The symbol you'll see most often for variance is the one for sample variance, s2. Because we generally don't know the population variance, sample variance is the one most commonly used. However, when we do know the population variance, we designate it with the symbol, . Here is another Greek letter, the lower case sigma. (Remember the summation sign, which is the upper case sigma? They don't look too much alike, do they? Sort of like some English letters.) When you write sigma, think of it as an o with a horizontal tail on top.

Variance is itself a mean. It is the average squared deviation of scores from the mean. What does this mean? Well, what we do is find out how far each score in a data set differs from the mean. You may remember that this is called deviation; it is designated in symbols as (score minus mean). Once we know the deviation of each score, we square each one of these deviations.

Now think about the definition of variance you just read. It is the average squared deviation of scores from the mean. This means that once you know the squared deviation from the mean for every score (This is what you just calculated.), you average these. How do we do this? Same old, same old--add them up and divide by N.

Raw Data. So, finding variance involves first finding the mean of a data set. Then you need to find the deviation from the mean for each score. This is easy; simply subtract the mean from each score. Then square each deviation. Then add up these numbers and divide by N. Here's a formula:

That's just what the formula tells us to do. Once again, I encourage you to view formulas as sets of instructions. If you remember that you must always work from the parentheses outward, the steps are clear too. First would be finding deviation, then squaring it, then summing all of these squares, then last, dividing by N.

Now to return one last tired time to the raw data on the statistics test scores:

X

 

 

 

 

96

 

 

 

 

96

 

 

 

 

92

 

 

 

 

92

 

 

 

 

88

 

 

 

 

88

 

 

 

 

84

 

 

 

 

84

 

 

 

 

80

 

 

 

 

80

 

 

 

 

80

 

 

 

 

76

 

 

 

 

76

 

 

 

 

76

 

 

 

 

76

 

 

 

 

72

 

 

 

 

72

 

 

 

 

72

 

 

 

 

72

 

 

 

 

64

 

 

 

 

64

 

 

 

 

60

 

 

 

 

56

 

 

 

 

44

 

 

 

 

We already have a mean; it is 76.7. So we're ready to compute deviations from the mean for each score. Set up a column for these, and go to work filling them in as you compute them.

X

 

 

 

96

19.3

 

 

 

96

19.3

 

 

 

92

15.3

 

 

 

92

15.3

 

 

 

88

11.3

 

 

 

88

11.3

 

 

 

84

7.3

 

 

 

84

7.3

 

 

 

80

3.3

 

 

 

80

3.3

 

 

 

80

3.3

 

 

 

76

-0.7

 

 

 

76

-0.7

 

 

 

76

-0.7

 

 

 

76

-0.7

 

 

 

72

-4.7

 

 

 

72

-4.7

 

 

 

72

-4.7

 

 

 

72

-4.7

 

 

 

64

-12.7

 

 

 

64

-12.7

 

 

 

60

-16.7

 

 

 

56

-20.7

 

 

 

44

-32.7

 

 

 

And then we square these deviations; note that squaring removes all the negative signs because multiplying a negative number by a negative number (what you're doing when you square them) results in a positive number.

X

 

 

96

19.3

372.5

 

 

96

19.3

372.5

 

 

92

15.3

234.1

 

 

92

15.3

234.1

 

 

88

11.3

127.7

 

 

88

11.3

127.7

 

 

84

7.3

53.3

 

 

84

7.3

53.3

 

 

80

3.3

10.9

 

 

80

3.3

10.9

 

 

80

3.3

10.9

 

 

76

-0.7

.5

 

 

76

-0.7

.5

 

 

76

-0.7

.5

 

 

76

-0.7

.5

 

 

72

-4.7

22.1

 

 

72

-4.7

22.1

 

 

72

-4.7

22.1

 

 

72

-4.7

22.1

 

 

64

-12.7

161.3

 

 

64

-12.7

161.3

 

 

60

-16.7

278.9

 

 

56

-20.7

428.5

 

 

44

-32.7

1069.3

 

 

 

 

3797.6

 

 

I've also gone ahead and totaled the squared deviations column, since this is the number we need in our formula, the sum of squared deviations. This sum is called the sum of squares (SS). Now we can compute variance for these data:

Ungrouped Frequency Distribution. Here the task is basically the same, but here again we must account for the fact that some scores occur than once. We could, as we just did, simply add the squared deviation for a score of 96 in twice, since 96 occurred twice; then add in the squared deviation for a score of 92 twice; etc. This is a great deal of work, though, as you know, having just done it.

A much better plan is to account for frequency before we add things up. To do this, our formula needs modification. The way we modify it should remind you of the way we altered the formula for mean when we needed to account for frequencies. Here it is:

All we need to do differently is, after squaring the deviations, multiply each by the frequency for that score. Then we proceed as usual from there. Let's look at the steps for our ungrouped frequency distribution:

X

f

 

 

 

96

2

19.3

372.5

745.0

 

 

 

92

2

15.3

234.1

468.2

 

 

 

88

2

11.3

127.7

255.4

 

 

 

84

2

7.3

53.3

106.6

 

 

 

80

3

3.3

10.9

32.7

 

 

 

76

4

-0.7

.5

2.0

 

 

 

72

4

-4.7

22.1

88.4

 

 

 

64

2

-12.7

161.3

322.6

 

 

 

60

1

-16.7

278.9

278.9

 

 

 

56

1

-20.7

428.5

428.5

 

 

 

44

1

-32.7

1069.3

1069.3

 

 

 

 

24

 

 

3797.6

 

 

 

The sum of squares is shown at the bottom of the last column, so we're ready to use the formula to find the variance:

Since the sum of squares is the same as it was for the raw data, it should come as no surprise that the variance is the same too. We're doing something right.

Grouped Frequency Distribution. For the grouped distribution, we must make the same modification in our procedure as we did when we computed mean for a grouped distribution. That is, we need a representative number for each class interval; and we'll get that number by finding a midpoint for each interval. Once we are using midpoint in place of a score, the procedure is identical to the one we just finished. Here's the modified formula:

Here are the steps for the grouped distribution (You will want to remember that we are using a slightly different mean--77.4--because we estimated the mean from the grouped distribution, and it's a bit different here.):

X (i=10)

f

m

 

 

 

90-99

4

94.5

17.1

292.4

1169.6

 

 

 

80-89

7

84.5

7.1

50.4

352.8

 

 

 

70-79

8

74.5

-2.9

8.4

67.2

 

 

 

60-69

3

64.5

-12.9

166.4

499.2

 

 

 

50-59

1

54.5

-22.9

524.4

524.4

 

 

 

40-49

1

44.5

-32.9

1082.4

1082.4

 

 

 

 

24

 

 

 

3695.6

 

 

 

And, using the formula:

Sampling Error. If you remember back to Chapter 1, often samples from a population are studied rather than the entire population. This is because of the expense or difficulty of studying an entire population. If samples are drawn properly, sample data may be used with some confidence to draw conclusions about the population from which they're drawn; but sampling causes some problems too. These problems are called sampling error. Sampling error is not from mistakes in sampling or from doing things wrong; it is error introduced by the act of sampling, even when you do everything right. Studying samples instead of populations introduces some kinds of inaccuracy, just because we're studying a sample instead of a population. This inaccuracy is sampling error. You'll be hearing more about sampling error throughout the course.

One place where sampling error introduces particular problems is with variance. That's because a population will always be more variable than a sample, simply because more cases are included in a population than a sample. Let's look at an example. Here are average ages in months of a very small population of high school students:

228

211

197

187

177

169

226

206

195

182

175

166

223

201

193

180

173

163

218

200

191

178

172

162

Now, what are the odds that a sample of N=5 from this population will have as much variability as the population? Slim to none, right? If you only pick out 5 scores, chances are they won't show the spread out from center shown by the population. A sample of N=10 will certainly show much greater variability--closer to that of the population, but still too small. So sample variances are artificially small; and the smaller the sample, the bigger the problem.

Sample variances don't provide a good reflection of the population variance because samples can't be as variable as the population; and small samples are particularly problematic. Sampling itself causes this.

What can we do about this? Whatever fix we apply should account for the fact the small samples need more fixing than big ones. Really huge samples hardly need any fixing at all.

Estimating Population Variance. There is a fix available, and it meets our criterion that it should make a bigger difference to small samples than to large ones. It is called corrected variance; and corrected variance is used to estimate population variance. This means that estimating population variance IS correcting sample variance--two ways of saying the same thing. Look at the formula for corrected variance (symbol ), and see if you can figure out what the fix is:

The only thing that is different from sample variance in this computation is that, once you have the sum of squares, you divide by N-1 instead of dividing by N. What does this accomplish? Well, dividing by a smaller number will make the resulting variance somewhat larger. Since sample variance tends to run too low, this seems like a step in the right direction.

In addition to that, correcting will increase the variance more when the N is small than when N is large. We said we wanted a fix that would make a bigger difference with small samples than large ones, didn't we? This is because small samples are more wrong than large ones. Look at this example:


The effect of correcting variance from a small sample is pretty big--variance becomes more than 10% greater. But when sample size is large, the effect is quite small--variance becomes only about 1% greater. This looks like just the ticket; a way to estimate population variance by correcting for sampling error--and it corrects according to the size of the original sample.

Our estimate of population variance from the raw data or ungrouped frequency distribution of those statistics test scores can be found by using the same sum of squares as before, only dividing it this time by N-1.

Working from the grouped frequency distribution, we do the same sort of thing, dividing the old sum of squares by N-1:

People tend to mix up variance and corrected variance. It is important to remember that the symbol for corrected variance wears a little hat, called a carat.

Standard Deviation. Standard deviation puts variability into easier to understand terms. We compute variance because we need it in some other statistical procedures you'll learn about later. But average squared deviation from the mean is not a very useful concept to think about. Standard deviation is more so. It is (roughly) the average deviation from the mean of all the scores. This tells us that, when the standard deviation is 4, scores differed from the mean by an average of 4 points. This is easier to wrap your brain around.

Standard deviation is derived directly from variance, so it's a pretty straightforward process, once you know how to find variance. Standard deviation is the square root of variance, and the symbols show that clearly. Sample standard deviation is s; and population standard deviation is . To find standard deviation, you must know variance. The formula follows:

Simple, isn't it? So, to find the standard deviation from raw data, from an ungrouped frequency distribution, or from a grouped frequency distribution, simply find the variance and take its square root.

When we worked from the raw data and from the ungrouped frequency distribution, we found variance to be the same for both. It was 158.2. To find standard deviation:

Working from the grouped frequency distribution, we found a slightly different variance. (You should remember that this is because we can only estimate from a grouped distribution.) It was 154.0. To find standard deviation:

Corrected Standard Deviation. Because standard deviation is derived from variance, it has the same problem with sampling error as variance. We fix the problem in the same way too. Simply compute corrected standard deviation by taking the square root of corrected variance:

One thing that is important to know: when you have a standard deviation and wish to correct it, you cannot directly correct it. You must first convert it back to variance, then correct the variance, then find the corrected standard deviation. There are no shortcuts.

SKEW

Skew, you may remember, is the amount of lean away from normal in a distribution. We can compute the degree and direction of skew for a distribution. The best method for doing this is the Pearson formula for skew:

Note that if the mean is larger than the median, then the result will be positive, indicating a lean to the left. If the mean is smaller than the median, then the result will be negative, indicating a lean to the right. Let's compute skew for our statistics grades:

Interestingly, the difference made by estimating mean and median for the grouped distribution changed the direction of skew. Fortunately, this amount of skew is pretty slight, indicating the distribution is pretty close to a normal one. Therefore, direction isn't such a big deal after all.

CONCLUSION

You've finished Chapter 3. If you haven't gone through the Comprehension Checks in the chapter, I encourage you to do so. Once you've worked your way through these and checked your answers, try the Review Exercises at the end of the chapter. Remember, the answers for these are provided in your book too. This gives you many opportunities to practice and to check your understanding.

When you've finished all of the practice problems in the textbook, request a Chapter 3 worksheet via e-mail. It will be sent along with answers, so that you may check your work. When you feel ready, request the test.

 

HOME