The SAS data set SASUSER.AIR contains information regarding CO levels in the air. For this example, we will use a modified version of that data set called NEWAIR. The dependent (y) variable in this data set is CO the carbon monoxide level in the air. In this lab, we will build a predictive model for CO. We have 4 potential predictors:
NO the nitrogen oxide level in the air
SO2 the sulfur dioxide level in the air
DUST the amount of dust in the air
WIND the wind speed
1. Create plots of the dependent variable versus each of the independent variables. Do the relationships appear to be linear? What do you notice about the relationship between CO and WIND? Based on the SAS/Insight demo from lab, what polynomial term(s) should also be included in the model?
2. Use PROC CORR to examine the correlations between the variables. Which variables are the most strongly related to CO? Is there strong correlation between any of the independent variables?
3. Use PROC REG to generate partial regression plots for each of the independent variables. Why is it necessary to create partial regression plots? For this data set, what do these plots tell us?
4. We may also need to include interaction terms in our multiple regression model for CO. Using the output from the lab demo, record the value of the slope coefficient for each dust category. Do this for the SO2 model and the NO model. Is there evidence of potential interaction between DUST and SO2? Between DUST and NO? Explain.
5. In order to include interaction and polynomial terms in your regression model, you will need to create those terms using a data step. PROC REG does not allow you to directly specify higher order terms in the MODEL statement. Run the SAS code provided in lab to create the higher order terms. Then, fit a regression model to the data using the following MODEL statement in PROC REG:
MODEL co = so2 no dust wind so2no so2dust so2wind nodust nowind dustwind wind2 ;
Are there any terms which appear to be insignificant in this regression model? Which ones? Give evidence from the SAS output to support your answer.
6. Using the stepwise model selection technique, wind and wind2 do not enter into the model. So, well drop wind, wind2, and the interactions with wind from the model. Now consider the following model:
MODEL CO = SO2 NO DUST SO2NO SO2DUST NODUST ;
Run PROC REG for this model using SELECTION = FORWARD, SELECTION = BACKWARD, and SELECTION = STEPWISE. Report the final model for each of these selection methods. Are they the same?
Now, try changing the slentry and slstay criterion to 0.01 for the STEPWISE selection method. Does your final model change? Why?
7. Using the MODEL statement in question 6, run PROC REG again and include the INFLUENCE, R, VIF, SS1, and SS2 options in the model statement. Does it appear that there are any outliers in the data set? Give evidence from the output to support your answer. Do the variance inflation factors indicate that there is evidence of multicollinearity? Verify that the sequential sums of squares partition the model sums of squares into components associated with sequentially adding each variable to the model.
8. Use PROC GLM to compute the F-statistics and p-values for the full vs. reduced model tests based on the partial sums of squares. State the null and alternative hypotheses associated with the F-test for NO. Is NO a significant predictor of CO in the presence of the other variables in this model? Now, report the t-statistic and associated p-value that could also be used to test this hypothesis. Verify that the t-test and F-test are equivalent by showing that F = t2.
9. In question 4, we saw evidence of potential interactions between SO2 and DUST and between NO and DUST. Using either a t-test or an F-test, determine whether the interaction between SO2 and DUST is significant in this model. Do the same thing for the interaction between NO and DUST.