Homework 4

The SAS data set SASUSER.AIR contains information regarding CO levels in the air.  For this example, we will use a modified version of that data set called NEWAIR.  The dependent (y) variable in this data set is CO – the carbon monoxide level in the air.  In this lab, we will build a predictive model for CO.  We have 4 potential predictors:

NO – the nitrogen oxide level in the air

SO2 – the sulfur dioxide level in the air

DUST – the amount of dust in the air

WIND – the wind speed

1.     Create plots of the dependent variable versus each of the independent variables.  Do the relationships appear to be linear?  What do you notice about the relationship between CO and WIND?  Based on the SAS/Insight demo from lab, what polynomial term(s) should also be included in the model?

The relationship with NO is highly linear.  The relationships with CO and SO2 are slightly linear, but there is a lot of noise (e.g., scatter) in the data.  The relationship with WIND appears to be nonlinear.  Based on the SAS/Insight demo, it appears that a quadratic polynomial term should be included in the model for wind.

2.     Use PROC CORR to examine the correlations between the variables.  Which variables are the most strongly related to CO?  Is there strong correlation between any of the independent variables?

NO has the strongest correlation with CO (r = 0.97).  Dust also has a strong correlation with CO (r = 0.74).  The relationship between CO and SO2 is fairly weak (r = 0.075), and the correlation between CO and WIND may be a little misleading since the correlation measures the strength of the linear relationship between 2 variables, and we have observed that the relationship between CO and WIND is non-linear.  It also appears that there is a strong correlation between DUST and NO (r = 0.705).  Therefore, we should be aware of the possibility that multicollinearity may exist in the data set.

3.     Use PROC REG to generate partial regression plots for each of the independent variables.  Why is it necessary to create partial regression plots?  For this data set, what do these plots tell us?

The scatter plots that we created in question 1 only show us the relationship between CO and each independent variable while IGNORING THE EFFECT of any other variables.  However, our interpretation of the coefficients in the multiple regression model requires us to look at the relationship between CO and each independent variables IN THE PRESENCE OF (or, AFTER ADJUSTING FOR) the other independent variables in the model.  The partial regression plots allow us to look at the relationship between CO and each independent variable after adjusting for the effect of the other independent variables, so they give us a better understanding of whether the relationships will still be linear when all of the independent variables are in the model.  For this data set, the relationship between CO and NO still appears to be strongly linear after adjusting for the effect of the other variables.  However, the relationships between CO and the other variables appear to be pretty weak (but not non-linear) in the presence of the other variables in the model.

4.     We may also need to include interaction terms in our multiple regression model for CO.  Using the output from the lab demo, record the value of the slope coefficient for each dust category.  Do this for the SO2 model and the NO model.  Is there evidence of potential interaction between DUST and SO2?  Between DUST and NO?  Explain.

 DUST CATEGORY Slope for SO2 Slope for NO 1 -0.035 0.699 2 -0.145 0.618 3 0.876 0.631 4 4.151 0.763 5 -0.690 0.658

If there is an interaction between 2 variables, x1 and x2, then the relationship between x1 and y changes for different levels of the variable x2.  In this example, we have considered the relationship between CO and SO2 for different levels of DUST.  We can see that the slope changes dramatically for the different levels.  Therefore, it appears that there is evidence of an interaction between SO2 and DUST.  For NO, there are also changes in the value of the slope for the different levels of DUST.  Although the changes are less dramatic than for SO2, there is still evidence of a potential interaction between DUST and NO.  Therefore, we should include the interaction terms SO2*DUST and NO*DUST in our model and test to determine whether or not those interaction terms are significant in our model.

5.     In order to include interaction and polynomial terms in your regression model, you will need to create those terms using a data step.  PROC REG does not allow you to directly specify higher order terms in the MODEL statement.  Run the SAS code provided in lab to create the higher order terms.  Then, fit a regression model to the data using the following MODEL statement in PROC REG:

MODEL co = so2 no dust wind so2no so2dust so2wind nodust nowind dustwind wind2 ;

Are there any terms which appear to be insignificant in this regression model?  Which ones?  Give evidence from the SAS output to support your answer.

Consider using a 5% significance level.  The p-values in the Pr>|t| column will then tell us which variables are insignificant (e.g., any term with a p-value greater than 0.05 will be considered insignificant).  Based on this criteria, SO2WIND (p-value=0.2248), NODUST (p-value=0.1066), and DUSTWIND (p-value=0.5023) are insignificant.

6.     Using the stepwise model selection technique, wind and wind2 do not enter into the model.  So, we’ll drop wind, wind2, and the interactions with wind from the model.  Now consider the following model:

MODEL CO = SO2 NO DUST SO2NO SO2DUST NODUST ;

Run PROC REG for this model using SELECTION = FORWARD, SELECTION = BACKWARD, and SELECTION = STEPWISE.  Report the final model for each of these selection methods.  Are they the same?

For FORWARD selection, all independent variables (SO2, NO, DUST, SO2NO,  SO2DUST, and  NODUST) are included in the final model.

For BACKWARD selection, all independent variables (SO2, NO, DUST, SO2NO,  SO2DUST, and  NODUST) are included in the final model.

For STEPWISE selection, all independent variables (SO2, NO, DUST, SO2NO,  SO2DUST, and  NODUST) are included in the final model.

For this data set, FORWARD, BACKWARD, and STEPWISE all give the same final model.  However, note that this is not always the case.  That is, it is possible for the methods to come up with different final models.

Now, try changing the slentry and slstay criterion to 0.01 for the STEPWISE selection method.  Does your final model change?  Why?

Yes.  SO2 is no longer included in the final model.  Variables enter and leave the model based on the SLENTRY and SLSTAY criteria.  We have made the criteria more stringent (e.g., to enter the model, a variables must have a p-value less than 0.01, and to leave the model, a variable must have a p-value greater than 0.01).

7.     Using the MODEL statement in question 6, run PROC REG again and include the INFLUENCE, R, VIF, SS1, and SS2 options in the model statement.  Does it appear that there are any outliers in the data set?  Give evidence from the output to support your answer.  Do the variance inflation factors indicate that there is evidence of multicollinearity?  Verify that the sequential sums of squares partition the model sums of squares into components associated with sequentially adding each variable to the model.

To identify potential outliers, compare the influence statistics (studentized residual, hat value, DFFITS) to the cutoffs given in class.  Generally, I would say that any observation having 2 or more of these statistics outside of the acceptable range would be considered a potential outlier.

If the maximum VIF is greater than 10, then that suggests that there may be multicollinearity present in the model.  In this case, the maximum VIF is 26.969 (which is greater than 10), so we do have evidence of multicollinearity.  This should not be surprising since we have already noticed that the correlation between DUST and NO is 0.7.

The model SS for this model is 267.193.  If you add up the Type I SS (excluding the SS for the intercept), they should sum to this value.

8.     Use PROC GLM to compute the F-statistics and p-values for the full vs. reduced model tests based on the partial sums of squares.  State the null and alternative hypotheses associated with the F-test for NO.  Is NO a significant predictor of CO in the presence of the other variables in this model?  Now, report the t-statistic and associated p-value that could also be used to test this hypothesis.  Verify that the t-test and F-test are equivalent by showing that F = t2.

Ho: BNO = 0

Ha: BNO not = 0

The value of the F-statistic is 98.72 and the associated p-value for the F-test is <0.0001.  Therefore, at a 5% significance level, the p-value is less than 0.05, so we reject the null hypothesis and conclude that NO is a significant predictor of CO in the presence of the other variables in the model.

The value of the t-statistic is 9.94 and the associated p-value is 0.0001.  You can see that (9.94*9.94) = 98.72, thus verifying the equivalence of the 2 tests.

9.     In question 4, we saw evidence of potential interactions between SO2 and DUST and between NO and DUST.  Using either a t-test or an F-test, determine whether the interaction between SO2 and DUST is significant in this model.  Do the same thing for the interaction between NO and DUST.

To test whether or not an interaction is significant in the model, simply perform the test to determine whether or not the coefficient on the interaction term is equal to 0.  That is, conduct the following hypothesis test

Ho:  Binteraction = 0

Ha: Binteraction not = 0

Using the t-tests from the parameter estimates section of the PROC REG output, the t-statistic for testing the interaction between NO and DUST is -4.91 and the p-value for the associated hypothesis test is <0.0001.  For the interaction between SO2 and DUST, the t-statistic is -4.44 and the p-value is <0.0001.

Using the F-statistics from the Type III SS section in the PROC GLM output, the F-statistic for testing the interaction between NO and DUST is 24.09 and the p-value for the associated hypothesis test is <0.0001.  For the interaction between SO2 and DUST, the F-statistic is 19.75 and the p-value is <0.0001.

So, in each case we would reject the null hypothesis and conclude that the interaction term is significant in the model.