1. The U.S. National Center for Health Statistics compiles data on the length of stay by patients in short-term hospitals and publishes its findings in Vital and Health Statistics. A random sample of nine patients yielded the following data on length of stay (in days). Use this data to answer the following questions:
Length of Stay
a. The mean length of stay in the hospital was 14 days. The median length of stay in the hospital was 9 days. Why is the mean so much larger than the median for this example?
There is an outlier (55) which will inflate the value of the mean.
The median is more resistant to outliers.
b. Would you use the standard deviation or the 1st and 3rd quartiles as the measure of spread for this example? Why?
You would use the 1st and 3rd quartiles. The standard deviation is strongly affected by outliers, but the quartiles are resistant to outliers.
| _ |
| | | _
90 + | | |
| | +-----+ |
| | | | |
| | | | |
80 + +-----+ *-----* +-----+
| | | | | | |
| | | +-----+ | |
| *-----* | | |
70 + | | | | |
| | | | *-----*
| | | | | |
| | | | | |
60 + | | _ | |
| | | | |
| | | +-----+
| +-----+ |
50 + | |
| | |
| | |
| | |
40 + | |
| | |
| | |
| _ |
30 + |
20 + |
Teacher A B C
2. Three sections of the same basic math course are taught by three different teachers: A, B, and C. The final grades were recorded, and the graph above provides a comparison of the final grades from each of the three classes. Based on this graph, which teacher’s class would you prefer to take and why? Label any relevant information and use the graph to justify your answer.
The scores from teacher B's class are less variable, and the median is higher than the 3rd quartiles for either of the other 2 teachers. This would indicate that teacher B consistently gives higher scores than the other 2 teachers, so her section might be more desirable.
3. Based on the shape of the following graphs, indicate whether the correlation is positive, negative, or zero. If the correlation is either positive or negative, indicate whether it is strong or weak. If the correlation is zero, explain why.
The correlation is negative and fairly weak.
The correlation is approximately zero because the relationship between x and y is not linear.
The correlation is positive and fairly strong.
4. In an article in USA Today (December 28, 1984), sociologists
N. Glenn and B. A. Shelton are quoted as showing a strong link between
residential mobility (e.g., how often an individual moves) and divorce
rates. The following table shows the annual divorce rate (number
of divorces per 1000 population) and the mobility rate (percent of people
who moved within the last 5 years) for different regions of the United
States. We will use this data to try to predict the divorce rate
from the mobility rate.
Mobility Rate Divorce Rate
New England 41 4.0
Middle Atlantic 37 3.4
East North Central 44 5.1
West North Central 46 4.6
South Atlantic 47 5.6
East South Central 44 6.0
West South Central 50 6.5
Mountain 57 7.6
Pacific 56 5.9
a. Based on the following scatterplot and the value of the correlation (r=0.85516) between divorce rate and mobility rate, how would you describe the relationship between these two variables (e.g., shape, strength, direction)?
It appears that there is a fairly strong, positive, linear relationship
between the 2 variables.
b. For this data set, we have used regression to calculate the following:
a = -2.49 b = .17 R-squared = 0.7313
Write down the equation for the regression line used to predict divorce rate from mobility.
y = -2.49 + 0.17x
How do you interpret the slope for this particular data set?
For every 1 percent increase in the mobility rate, there is a corresponding 0.17 increase in the divorce rate.
What percent of the variation in divorce rate can be explained by changes in the mobility rate?
Use the regression equation to predict the divorce rate for a region whose mobility rate is 40.
y = -2.49 + (0.17)(40) = -2.49 + 6.8 = 4.31
c. Should we use this regression equation to predict the divorce rate for a region whose mobility rate is 60? Why or why not?
No. Prediction for a mobility rate of 60 would be extrapolation since 60 is outside of the range of x-values used to generate the regression line.
d. Based on the results of our regression, can we make the following statement? Why or why not?
“Increasing the mobility rate in a region CAUSES the divorce rate to increase.”
No. Association does not imply causation.
BONUS : Can you think of a second variable that might be affecting both mobility rate and divorce rate? Explain briefly.
Each of the following situations is a 2-factor factorial experimental design. For each case, identify the response variable and both factors, and state the number of levels for each factor and the total number of observations.
a. A study of the productivity of tomato plants compares five varieties of tomatoes and two types of fertilizer. Four plants of each variety are grown with each type of fertilizer. The yield in pounds of tomatoes is recorded for each plant.
response = yield
factor 1 = variety with 5 levels
factor 2 = type of fertilizer with 2 levels
total number of observations = 5 x 2 x 4 = 40
b. A marketing experiment compares six different types of packaging for a laundry detergent. A survey is conducted to determine the attractiveness of the packaging in six US cities. Each type of packaging is shown to 50 different consumers in each city, who rate the attractiveness of the product on a 1 to 10 scale.
response = attractiveness (on a scale of 1 to 10)
factor 1 = type of packaging with 6 levels
factor 2 = city with 6 levels
total number of observations = 6 x 6 x 50 = 1800
c. To compare the effectiveness of four different weight-loss programs, 10 men and 10 women are randomly assigned to each. At the end of the program, the weight loss for each of the participants is recorded.
response = weight loss
factor 1 = weight loss program with 4 levels
factor 2 = gender with 2 levels
total number of observations = (4 x 10) + (4 x 10) = 80
In the last example above, gender was considered as one of the factors. Could you think of another way of handling the gender variable? If so, what is it, and what advantage might it have over the method described above?
Gender could be considered as a blocking variable which would allow
for us to control for differences between gender.