Examining Relationships between Variables:
Correlation and Regression Analysis
Do you think some of the variables in your data set might be used to explain changes in other variables?
Dependent Variable: Measures an outcome of a study; the thing you’re interested in explaining
Independent variable: Attempts to explain the response variable.
Just because the response and explanatory variables are related doesn’t mean that there is a causal relationship.
To analyze the relationships between more than one variable:
1. Start with a graph illustrating the relationship between the variables.
2. Look for patterns in the data.
3. Calculate numerical descriptions of the data (correlation).
4. Briefly describe the overall pattern (regression line).
Plot the dependent variable (y) on the vertical axis. Plot the independent variable (x) on the horizontal axis. Each point on the plot represents one individual in the data set.
From the graph:
· Look for an overall pattern
· Look for the direction of the relationship (positive, negative)
· Look for the form of the relationship (linear, clusters, cyclic)
· Look for the strength of the relationship (tight, scattered)
Positive association: As the values of one variable increase, the values of the 2nd variable also increase; AND, as the values of one variable decrease, the values of the other variables also decrease.
Negative association: As the values of one variable decrease the values of the 2nd variable increase; AND, as the values of one variable increase, the values of the other variable decrease.
You can use correlation to measure the strength and direction of the relationship between two variables.
Correlation: measures the STRENGTH and DIRECTION of a LINEAR relationship between two QUANTITATIVE variables.
Properties of the correlation (r):
· r > 0 => positive association
· r < 0 => negative association
· r is always between 1 and –1
· r = 1 means that there is a perfect positive linear relationship between two variables.
· r = -1 means that there is a perfect negative linear relationship between two variables.
· r = 0 means that there is no linear relationship between two variables.
· The larger the value of |r| (e.g., the closer r is to 1 or –1), the stronger the linear relationship between the two variables.
· The correlation, r, only measures the strength of the LINEAR relationship between two variables. So, be sure to look at the data. The two variables may be strongly related, but not linearly related.
· The correlation is strongly affected by outliers.
Least Squares Regression
If the scatter plot and correlation indicate a linear relationship, then we would like to be able to draw a line to summarize the relationship between the variables.
Regression Line: summarizes the linear relationship between the response variable and an explanatory variable and can be used to predict a value of the response variable for any value of the explanatory variable (within an acceptable range).
We want to draw the line so that we minimize the distance from the line to the points.
Least Squares Regression Line: minimizes the sum of squares of the vertical distances from the data points to the regression line.
Any straight line can be characterized by the following equation:
y = a + b x
a = intercept – the point where the line intersects the y-axis; the value of y when x=0.
b = slope – the change in y for a 1-unit increase in x.
b > 0 => positive association (increase x by 1, then y increases by b)
b < 0 => negative association (increase x by 1, then y decreases by b)
b = 0 => no association (knowing about x tells you nothing about y)
So, in order to characterize the best regression line for our data, we need to come up with estimates of the slope (b) and intercept (a). We derive the values of a and b using the method of least squares (see Lapin p. 91-92 for details and formulae).
How do we determine the strength of the linear relationship between y and x? Is there any measure of how good the regression model is? That is, do we have any indication of how much predictive power or ability the model has?
R2 is called the coefficient of determination. It tells you the proportion of the variation in the values of y that is explained by the regression of y on x. That is, it tells you what percent of the variability in y can be explained simply by knowing x.
When you report the results of a regression analysis, you should always give R2 as a measure of how successful the regression was in explaining the response.
Properties of R2:
· R2 is always between 0 and 1
· If R2 = 1, then ALL of the variation in y can be explained by the linear relationship between y and x.
· If R2 = 0, then NONE of the variation in y can be explained by the linear relationship between y and x.
· The closer R2 is to 1, the stronger the linear relationship between y and x and the more confident we are in our regression model.
The regression line provides a good approximation to the shape of the data, but it doesn’t fit it exactly. There is some error in the predictions. For any point, we can measure that error. The error in the prediction is called the residual.
Residual = the difference between an observed value of y and the value predicted by the regression line
Residual = observed y – predicted y
You can use the residuals to help determine how well the regression line fits the data.
A residual plot plots the values of the x (explanatory) variable on the horizontal axis and the values of the residual on the vertical axis.
FACT: Residuals froma least squares regression always sum to zero.
If the regression line fit the data well, the residuals should be uniformly scattered about 0.
Summary: Interpreting Regression and Correlation
· Always be aware of the limitations of these methods.
· They only describe linear relationships.
· Both correlation and regression can be strongly influenced by extreme values or outliers.
· Always plot the data before interpreting a regression or correlation analysis.
· Be careful of extrapolation.
Extrapolation: using the regression line for prediction outside of the range of values of x that were used to generate the line.
· Association does not imply causation. We often want to say that changes in x CAUSE changes in y. This may NOT be the case.
There may be other variables called lurking variables that are related to both y and x that are affecting the relationship between y and x.
· Age is a lurking variable in the relationship seen between height and math test scores.
· Turbulence does not CAUSE the fasten seat belt sign to come on.
· Fiber does not cause cholesterol levels to go down. Fiber usually replaces more fatty foods and the reduction in fat intake causes the cholesterol to go down.
In order to be able to prove causation, you need to do an experiment where you hold all other factors constant.
An economist is interested in the relationship between the disposable income of a family and the amount of money spent annually on food. For a preliminary study, the economist takes a random sample of eight middle-income families of the same size (father, mother, and two children). The goal is to be able to predict the amount spent on food based on the annual income. The data are listed below:
(in hundreds of dollars)
(in thousands of dollars)
1. Identify the response and explanatory variables.
2. Make a scatterplot illustrating the relationship between the amount spent on food and a family’s disposable income.
3. Based on this graph, describe the relationship between amount spent on food and income (e.g., shape, strength, direction). Are there any potential outliers?
The relationship between food and income appears to be basically linear. There is a positive association between the two variables since the amount spent on food increases as income increases. The relationship seems to be fairly strong. The family whose income is $24,000 appears to be a potential outlier.
4. The correlation between the amount spent on food and the family’s income is r = 0.74. What does this tell us about the relationship between food and income? Does it support our findings from the scatter plot?
This value tells us that there is a positive association between the two variables. It also tells us that the relationship is fairly strong since 0.74 is relatively close to 1. This statistic serves to confirm what we saw in the graph.
5. Is it appropriate to fit a linear regression model to these data?
Yes. There appears to be a fairly strong linear relationship between the two variables, so a linear regression model is appropriate for describing the relationship.
6. If we fit a linear regression model to the data, our estimate of the intercept is a = 12.86 and our estimate of the slope is b = 1.21. Write down the prediction equation for this model. Does the intercept make sense in this case (e.g., is there a logical explanation for the intercept)? What does the slope tell us about the relationship between the amount spent on food and a family’s income? How much money would we expect a family whose income was $35 thousand per year to spend on food?
Prediction Equation: y= 12.86 + 1.21x
The intercept makes sense. It tells us that, even if a family has no income, it would still need to spend about $1,286 per year on food. So, the family has to eat, even if it has no income.
The slope tells us that as a family’s income increases by $1 per year, the amount that they spend on food will increase by $1.21 per year. Or, equivalently, if the family’s income increases by $1000 per year, the amount that they spend on food will increase by $1210 per year.
If a family’s income was $35 thousand per year, the regression model predicts that they would spend $5521 per year on food.