Examining
Relationships between Variables:
Correlation
and Regression Analysis
Do
you think some of the variables in your data set might be used to explain
changes in other variables?
Dependent Variable: Measures an outcome of a study;
the thing you’re interested in explaining
Independent variable: Attempts to explain the response variable.
Just
because the response and explanatory variables are related doesn’t mean that
there is a causal relationship.
To
analyze the relationships between more than one variable:
1.
Start
with a graph illustrating the relationship between the variables.
2.
Look
for patterns in the data.
3.
Calculate
numerical descriptions of the data (correlation).
4.
Briefly
describe the overall pattern (regression line).
Scatterplots:
Plot
the dependent variable (y) on the vertical axis. Plot the independent variable (x) on the horizontal axis. Each point on the plot represents one
individual in the data set.
From
the graph:
·
Look
for an overall pattern
·
Look
for the direction of the relationship (positive, negative)
·
Look
for the form of the relationship (linear, clusters, cyclic)
·
Look
for the strength of the relationship (tight, scattered)
Positive association: As the values of one variable increase, the values of the 2^{nd}
variable also increase; AND, as the values of one variable decrease, the values
of the other variables also decrease.
Negative association: As the values of one variable decrease the values of the 2^{nd}
variable increase; AND, as the values of one variable increase, the values of
the other variable decrease.
You
can use correlation to measure the
strength and direction of the relationship between two variables.
Correlation: measures the STRENGTH and DIRECTION of a LINEAR relationship
between two QUANTITATIVE variables.
Properties
of the correlation (r):
·
r
> 0 => positive association
·
r
< 0 => negative association
·
r
is always between 1 and –1
·
r
= 1 means that there is a perfect positive linear relationship between two
variables.
·
r
= -1 means that there is a perfect negative linear relationship between two
variables.
·
r
= 0 means that there is no linear relationship between two variables.
·
The
larger the value of |r| (e.g., the closer r is to 1 or –1), the stronger the
linear relationship between the two variables.
·
The
correlation, r, only measures the strength of the LINEAR relationship between
two variables. So, be sure to look at
the data. The two variables may be
strongly related, but not linearly related.
·
The
correlation is strongly affected by outliers.
Least Squares Regression
If
the scatter plot and correlation indicate a linear relationship, then we would
like to be able to draw a line to summarize the relationship between the
variables.
Regression
Line: summarizes the linear
relationship between the response variable and an explanatory variable and can
be used to predict a value of the response variable for any value of the
explanatory variable (within an acceptable range).
We
want to draw the line so that we minimize the distance from the line to the
points.
Least
Squares Regression Line: minimizes the sum of squares
of the vertical distances from the data points to the regression line.
Any
straight line can be characterized by the following equation:
y =
a + b x
a =
intercept – the point where the line intersects the y-axis; the value of y when
x=0.
b =
slope – the change in y for a 1-unit increase in x.
b
> 0 => positive association
(increase x by 1, then y increases by b)
b
< 0 => negative association (increase x by 1, then y decreases by b)
b =
0 => no association (knowing about x tells you nothing about y)
So,
in order to characterize the best regression line for our data, we need to come
up with estimates of the slope (b) and intercept (a). We derive the values of a and b using the method of least squares
(see Lapin p. 91-92 for details and formulae).
How
do we determine the strength of the linear relationship between y and x? Is there any measure of how good the
regression model is? That is, do we
have any indication of how much predictive power or ability the model has?
R^{2}
is called the coefficient of determination.
It tells you the proportion of the variation in the values of y that is
explained by the regression of y on x.
That is, it tells you what percent of the variability in y can be
explained simply by knowing x.
When
you report the results of a regression analysis, you should always give R^{2}
as a measure of how successful the regression was in explaining the response.
Properties
of R^{2}:
·
R^{2}
is always between 0 and 1
·
If
R^{2} = 1, then ALL of the variation in y can be explained by the
linear relationship between y and x.
·
If
R^{2} = 0, then NONE of the variation in y can be explained by the linear
relationship between y and x.
·
The
closer R^{2} is to 1, the stronger the linear relationship between y
and x and the more confident we are in our regression model.
The
regression line provides a good approximation to the shape of the data, but it
doesn’t fit it exactly. There is some
error in the predictions. For any
point, we can measure that error. The
error in the prediction is called the residual.
Residual
= the difference between an observed value of y and the value predicted by the
regression line
Residual
= observed y – predicted y
You
can use the residuals to help determine how well the regression line fits the
data.
A
residual plot plots the values of the x (explanatory) variable on the
horizontal axis and the values of the residual on the vertical axis.
FACT: Residuals froma least squares regression
always sum to zero.
If
the regression line fit the data well, the residuals should be uniformly
scattered about 0.
Summary: Interpreting Regression and Correlation
·
Always
be aware of the limitations of these methods.
·
They
only describe linear relationships.
·
Both
correlation and regression can be strongly influenced by extreme values or
outliers.
·
Always
plot the data before interpreting a regression or correlation analysis.
·
Be
careful of extrapolation.
Extrapolation: using the regression line for prediction
outside of the range of values of x that were used to generate the line.
·
Association
does not imply causation. We often want
to say that changes in x CAUSE changes in y.
This may NOT be the case.
There
may be other variables called lurking
variables that are related to both y and x that are affecting the
relationship between y and x.
Examples:
·
Age
is a lurking variable in the relationship seen between height and math test
scores.
·
Turbulence
does not CAUSE the fasten seat belt sign to come on.
·
Fiber
does not cause cholesterol levels to go down.
Fiber usually replaces more fatty foods and the reduction in fat intake
causes the cholesterol to go down.
In
order to be able to prove causation, you need to do an experiment where you
hold all other factors constant.
Regression
Example:
An
economist is interested in the relationship between the disposable income of a
family and the amount of money spent annually on food. For a preliminary study, the economist takes
a random sample of eight middle-income families of the same size (father,
mother, and two children). The goal is
to be able to predict the amount spent on food based on the annual income. The data are listed below:
Food (in hundreds of dollars) |
Income (in thousands of dollars) |
30 |
55 |
36 |
60 |
27 |
42 |
20 |
40 |
16 |
37 |
24 |
26 |
19 |
39 |
25 |
43 |
1.
Identify
the response and explanatory variables.
2.
Make
a scatterplot illustrating the relationship between the amount spent on food
and a family’s disposable income.
3.
Based
on this graph, describe the relationship between amount spent on food and
income (e.g., shape, strength, direction).
Are there any potential outliers?
The relationship between food and income appears to be basically
linear. There is a positive association
between the two variables since the amount spent on food increases as income
increases. The relationship seems to be
fairly strong. The family whose income
is $24,000 appears to be a potential outlier.
4.
The
correlation between the amount spent on food and the family’s income is r =
0.74. What does this tell us about the
relationship between food and income?
Does it support our findings from the scatter plot?
This value tells us that there is a positive
association between the two variables.
It also tells us that the relationship is fairly strong since 0.74 is
relatively close to 1. This statistic
serves to confirm what we saw in the graph.
5.
Is
it appropriate to fit a linear regression model to these data?
Yes.
There appears to be a fairly strong linear relationship between the two
variables, so a linear regression model is appropriate for describing the
relationship.
6.
If
we fit a linear regression model to the data, our estimate of the intercept is
a = 12.86 and our estimate of the slope is b
= 1.21. Write down the
prediction equation for this model.
Does the intercept make sense in this case (e.g., is there a logical
explanation for the intercept)? What
does the slope tell us about the relationship between the amount spent on food
and a family’s income? How much money would
we expect a family whose income was $35 thousand per year to spend on food?
Prediction Equation: y= 12.86 + 1.21x
The intercept makes sense. It tells us that, even if a family has no
income, it would still need to spend about $1,286 per year on food. So, the family has to eat, even if it has no
income.
The slope tells us that as a family’s income
increases by $1 per year, the amount that they spend on food will increase by
$1.21 per year. Or, equivalently, if the family’s income increases by $1000 per
year, the amount that they spend on food will increase by $1210 per year.
If a family’s income was $35 thousand per
year, the regression model predicts that they would spend $5521 per year on
food.