Homework 4
The SAS data set SASUSER.AIR
contains information regarding CO levels in the air. For this example, we will use a modified
version of that data set called NEWAIR.
The dependent (y) variable in this data set is CO – the carbon monoxide
level in the air. In this lab, we will
build a predictive model for CO. We have
4 potential predictors:
NO – the nitrogen oxide level
in the air
SO2 – the sulfur dioxide
level in the air
DUST – the amount of dust in
the air
WIND – the wind speed
1.
Create plots of
the dependent variable versus each of the independent variables. Do the relationships appear to be
linear? What do you notice about the
relationship between CO and WIND? Based
on the SAS/Insight demo from lab, what polynomial term(s) should also be included
in the model?
The relationship with NO is highly
linear. The relationships with CO and
SO2 are slightly linear, but there is a lot of noise (e.g., scatter) in the
data. The relationship with WIND appears
to be nonlinear. Based on the SAS/Insight
demo, it appears that a quadratic polynomial term should be included in the
model for wind.
2.
Use PROC CORR to
examine the correlations between the variables.
Which variables are the most strongly related to CO? Is there strong correlation between any of
the independent variables?
NO has the strongest correlation with CO
(r = 0.97). Dust also has a strong
correlation with CO (r = 0.74). The
relationship between CO and SO2 is fairly weak (r = 0.075), and the correlation
between CO and WIND may be a little misleading since the correlation measures
the strength of the linear relationship between 2 variables, and we have
observed that the relationship between CO and WIND is non-linear. It also appears that there is a strong
correlation between DUST and NO (r = 0.705).
Therefore, we should be aware of the possibility that multicollinearity may exist in the data set.
3.
Use PROC REG to
generate partial regression plots for each of the independent variables. Why is it necessary to create partial
regression plots? For this data set,
what do these plots tell us?
The scatter plots that we created in
question 1 only show us the relationship between CO and each independent
variable while IGNORING THE EFFECT of any other variables. However, our interpretation of the
coefficients in the multiple regression model requires us to look at the
relationship between CO and each independent variables IN THE PRESENCE OF (or,
AFTER ADJUSTING FOR) the other independent variables in the model. The partial regression plots allow us to look
at the relationship between CO and each independent variable after adjusting
for the effect of the other independent variables, so they give us a better
understanding of whether the relationships will still be linear when all of the
independent variables are in the model.
For this data set, the relationship between CO and NO still appears to
be strongly linear after adjusting for the effect of the other variables. However, the relationships between CO and the
other variables appear to be pretty weak (but not non-linear) in the presence
of the other variables in the model.
4.
We may also need
to include interaction terms in our multiple regression model
for CO. Using the output from the lab
demo, record the value of the slope coefficient for each dust category. Do this for the SO2 model and the NO
model. Is there evidence of potential
interaction between DUST and SO2?
Between DUST and NO? Explain.
DUST CATEGORY |
Slope for
SO2 |
Slope for
NO |
1 |
-0.035 |
0.699 |
2 |
-0.145 |
0.618 |
3 |
0.876 |
0.631 |
4 |
4.151 |
0.763 |
5 |
-0.690 |
0.658 |
If there is an interaction between 2
variables, x1 and x2, then the relationship between x1 and y changes for
different levels of the variable x2. In
this example, we have considered the relationship between CO and SO2 for
different levels of DUST. We can see
that the slope changes dramatically for the different levels. Therefore, it appears that there is evidence
of an interaction between SO2 and DUST.
For NO, there are also changes in the value of the slope for the different
levels of DUST. Although the changes are
less dramatic than for SO2, there is still evidence of a potential interaction
between DUST and NO. Therefore, we
should include the interaction terms SO2*DUST and NO*DUST in our model and test
to determine whether or not those interaction terms are significant in our
model.
5.
In order to
include interaction and polynomial terms in your regression model, you will
need to create those terms using a data step.
PROC REG does not allow you to directly specify higher order terms in
the MODEL statement. Run the SAS code
provided in lab to create the higher order terms. Then, fit a regression model to the data
using the following MODEL statement in PROC REG:
MODEL
co = so2 no dust wind so2no so2dust
so2wind nodust nowind dustwind wind2 ;
Are there any terms which appear to be
insignificant in this regression model?
Which ones? Give evidence from
the SAS output to support your answer.
Consider
using a 5% significance level. The
p-values in the Pr>|t| column will then tell us which variables are
insignificant (e.g., any term with a p-value greater than 0.05 will be
considered insignificant). Based on this
criteria, SO2WIND (p-value=0.2248), NODUST (p-value=0.1066), and DUSTWIND (p-value=0.5023)
are insignificant.
6.
Using the stepwise model selection technique, wind and wind2
do not enter into the model. So, we’ll
drop wind, wind2, and the interactions with wind from the model. Now consider the following model:
MODEL CO = SO2 NO DUST SO2NO SO2DUST NODUST ;
Run PROC REG for this model using SELECTION =
FORWARD, SELECTION = BACKWARD, and SELECTION = STEPWISE. Report the final model for each of these
selection methods. Are they the same?
For FORWARD
selection, all independent variables (SO2, NO, DUST, SO2NO, SO2DUST, and NODUST) are included in the final model.
For
BACKWARD selection, all independent variables (SO2, NO, DUST, SO2NO, SO2DUST, and NODUST) are included in the final model.
For
STEPWISE selection, all independent variables (SO2, NO, DUST, SO2NO, SO2DUST, and NODUST) are included in the final model.
For this
data set, FORWARD, BACKWARD, and STEPWISE all give the same final model. However, note that this is not always the
case. That is, it is possible for the
methods to come up with different final models.
Now, try changing the slentry
and slstay criterion to 0.01 for the STEPWISE
selection method. Does your final model
change? Why?
Yes. SO2 is no longer included in the final
model. Variables enter and leave the
model based on the SLENTRY and SLSTAY criteria.
We have made the criteria more stringent (e.g., to enter the model, a
variables must have a p-value less than 0.01, and to leave the model, a
variable must have a p-value greater than 0.01).
7.
Using the MODEL statement in question 6, run PROC REG again
and include the INFLUENCE, R, VIF, SS1, and SS2 options in the model
statement. Does it appear that there are
any outliers in the data set? Give
evidence from the output to support your answer. Do the variance inflation factors indicate
that there is evidence of multicollinearity? Verify that the sequential sums of squares
partition the model sums of squares into components associated with
sequentially adding each variable to the model.
To identify
potential outliers, compare the influence statistics (studentized
residual, hat value, DFFITS) to the cutoffs given in class. Generally, I would say that any observation
having 2 or more of these statistics outside of the acceptable range would be
considered a potential outlier.
If the
maximum VIF is greater than 10, then that suggests that there may be multicollinearity present in the model. In this case, the maximum VIF is 26.969
(which is greater than 10), so we do have evidence of multicollinearity.
This should not be surprising since we have already noticed that the
correlation between DUST and NO is 0.7.
The model
SS for this model is 267.193. If you add up the Type I SS (excluding the SS for the intercept),
they should sum to this value.
8.
Use PROC GLM to compute the F-statistics and p-values for the
full vs. reduced model tests based on the partial sums of squares. State the null and alternative hypotheses
associated with the F-test for NO. Is NO
a significant predictor of CO in the presence of the other variables in this model? Now, report the t-statistic and associated
p-value that could also be used to test this hypothesis. Verify that the t-test and F-test are
equivalent by showing that F = t^{2}.
Ho: B_{NO}
= 0
Ha: B_{NO}
not = 0
The value
of the F-statistic is 98.72 and the associated p-value for the F-test is
<0.0001. Therefore, at a 5%
significance level, the p-value is less than 0.05, so we reject the null
hypothesis and conclude that NO is a significant predictor of CO in the presence
of the other variables in the model.
The value
of the t-statistic is 9.94 and the associated p-value is 0.0001. You can see that (9.94*9.94) = 98.72, thus
verifying the equivalence of the 2 tests.
9.
In question 4, we saw
evidence of potential interactions between SO2 and DUST and between NO and
DUST. Using either a t-test or an
F-test, determine whether the interaction between SO2 and DUST is significant
in this model. Do the same thing for the
interaction between NO and DUST.
To test whether or not an interaction is
significant in the model, simply perform the test to determine whether or not
the coefficient on the interaction term is equal to 0. That is, conduct the following hypothesis
test
Ho:
B_{interaction} = 0
Ha: B_{interaction} not = 0
Using the
t-tests from the parameter estimates section of the PROC REG output, the
t-statistic for testing the interaction between NO and DUST is -4.91 and the
p-value for the associated hypothesis test is <0.0001. For the interaction between SO2 and DUST, the
t-statistic is -4.44 and the p-value is <0.0001.
Using the
F-statistics from the Type III SS section in the PROC GLM output, the
F-statistic for testing the interaction between NO and DUST is 24.09 and the
p-value for the associated hypothesis test is <0.0001. For the interaction between SO2 and DUST, the
F-statistic is 19.75 and the p-value is <0.0001.
So, in each
case we would reject the null hypothesis and conclude that the interaction term
is significant in the model.