* ======================================================================= * File: linreg_using_subset.SPS . * Date: 13-Nov-2003 . * Author: Bruce Weaver, weaverb@mcmaster.ca . * Notes: Perform linear regression on a subset of cases, but * save predicted scores for ALL cases in the file . * ======================================================================= . * This file shows how to perform linear regression using only a subset * of the cases in the working data file (by using the /SELECT subcommand * of the REGRESSION procedure). * But note that even if the regression equation is computed using only * a subset of cases, predicted scores are saved for ALL cases in the * working data file when you elect to save predicted scores. * The data is from Exercise 9.32 in Dave Howell's book "Statistical * Methods for Psychology" (4th Ed), ISBN: 0-534-51993-8. DATA LIST LIST /height (f5.1) weight (f5.0). BEGIN DATA. 70.00 150 67.00 140 72.00 180 75.00 190 68.00 145 69.00 150 71.50 164 71.00 140 72.00 142 69.00 136 67.00 123 68.00 155 66.00 140 72.00 145 73.50 160 73.00 190 69.00 155 73.00 165 72.00 150 74.00 190 72.00 195 71.00 138 74.00 160 72.00 155 70.00 153 67.00 145 71.00 170 72.00 175 69.00 175 73.00 170 74.00 180 66.00 135 71.00 170 70.00 157 70.00 130 75.00 185 74.00 190 71.00 155 69.00 170 70.00 155 72.00 215 67.00 150 69.00 145 73.00 155 73.00 155 71.00 150 68.00 155 69.50 150 73.00 180 75.00 160 66.00 135 69.00 160 66.00 130 73.00 155 68.00 150 74.00 148 73.50 155 END DATA. compute case = $casenum. exe. format case (f3.0). var lab weight 'Body Weight (lbs)' height 'Height (inches)' case 'Case number'. *=== Regression of Weight on Height: All Cases ====== . * Compute regression equation using all cases in the file. * Save predicted scores (i.e., predicted weights) as PRED1. REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS CI R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT weight /METHOD=ENTER height /RESIDUALS HIST(ZRESID) NORM(ZRESID) /SAVE PRED (PRED1) . *=== Regression of Weight on Height: Use subset of cases (1) ====== . * This time, use a subset of the cases when computing the regression equation. * First, compute a binary variable to divide the cases into two groups. * Arbitrarily, I will set variable FLAG = 1 for case numbers 6 and greater, * and FLAG = 0 for cases 1-5. compute FLAG = (case GE 6). exe. * One way to compute a regression equation using only the cases with FLAG=1 * is to set a filter. Let's try this method first. use all. filter by FLAG. /* That is, use only cases with FLAG = 1. exe. * Save predicted Y-score as PRED2. REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS CI R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT weight /METHOD=ENTER height /RESIDUALS HIST(ZRESID) NORM(ZRESID) /SAVE PRED (PRED2) . * Notice that no predicted Y-scores were saved for cases 1-5, * because they were filtered out. *=== Regression of Weight on Height: Use subset of cases (2) ====== . * There is another, perhaps better way to use only a subset of cases * when computing a regression equation. The REGRESSION procedure * has a /SELECT subcommand, which is used to specify a subset of * cases you wish to use when computing the regression equation. * To use it via the pull-down menus, click on Analyze-->Regression--> * Linear to open the main dialog for REGRESSION, then move your * FILTER variable (FLAG in our example) into the "Selection Variable" * box, and click on the Rule button. Set the Rule as you wish * (e.g., FLAG = 1 for our example), and then proceed as usual. * The syntax version is shown below. * NOTE: We have a filter set, and need to turn it off before going on. use all. filter off. * Save predicted Y-scores to variable PRED3 . REGRESSION /SELECT= FLAG EQ 1 /* use only cases with FLAG=1 */ /MISSING LISTWISE /STATISTICS COEFF OUTS CI R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT weight /METHOD=ENTER height /RESIDUALS HIST(ZRESID) NORM(ZRESID) /SAVE PRED (PRED3) . * Notice that THIS time, predicted scores have been saved for cases 1-5, * despite the fact that they were not included when the regression * equation was computed. compute diff1 = pred1 - pred2. compute diff2 = pred2 - pred3. exe. var lab pred1 'PRED1 (all cases)' pred2 'PRED2 (FILTER method)' pred3 'PRED3 (/SELECT method)' diff1 'PRED1 - PRED2' diff2 'PRED2 - PRED3'. format diff1 diff2 (f8.5). descrip pred1 pred2 pred3 diff1 diff2. * The predicted scores from the first regression (using all cases) * are not equal to the predicted scores from the 2nd regression * (using only cases 6-57 to develop the equation). The difference * between these two predicted scores is PRED1-PRED2. * It should reassure you that for cases 6-57, the predicted scores in * variables PRED2 and PRED3 are identical. In other words, only cases * 6-57 were included both when we used the FILTER method, and also when * we used the /SELECT method. *====== How/when might you use /SELECT? =========. * The /SELECT subcommand is very useful if you want to crossvalidate * a regression model. Crossvalidation refers to using one set of data * to develop a regression equation, and to then use that regression * equation to generate predicted scores for another data set. The * correlation between the actual scores and predicted scores in the * 2nd data set is an index of validity of the regressio equation. * That is, if the regression equation is valid, the correlation between * observed and predicted scores in the 2nd data set ought to be nearly * as good as in the original data set. (It will never be AS good, * because as Dave Howell puts it in his book, regression does its best * to fit "every bump and wiggle" in the data set used to develop * the equation. * End of file. * ======================================================================= .