* ======================================================================= 
*  File:  	linreg_using_subset.SPS .
*  Date:  	13-Nov-2003 .
*  Author:      Bruce Weaver, weaverb@mcmaster.ca .
*  Notes:	Perform linear regression on a subset of cases, but
*		save predicted scores for ALL cases in the file .
* ======================================================================= .

* This file shows how to perform linear regression using only a subset
* of the cases in the working data file (by using the /SELECT subcommand
* of the REGRESSION procedure).  

* But note that even if the regression equation is computed using only
* a subset of cases, predicted scores are saved for ALL cases in the
* working data file when you elect to save predicted scores.

* The data is from Exercise 9.32 in Dave Howell's book "Statistical
* Methods for Psychology" (4th Ed), ISBN: 0-534-51993-8.

DATA LIST LIST /height (f5.1) weight (f5.0).
BEGIN DATA.
70.00	150
67.00	140
72.00	180
75.00	190
68.00	145
69.00	150
71.50	164
71.00	140
72.00	142
69.00	136
67.00	123
68.00	155
66.00	140
72.00	145
73.50	160
73.00	190
69.00	155
73.00	165
72.00	150
74.00	190
72.00	195
71.00	138
74.00	160
72.00	155
70.00	153
67.00	145
71.00	170
72.00	175
69.00	175
73.00	170
74.00	180
66.00	135
71.00	170
70.00	157
70.00	130
75.00	185
74.00	190
71.00	155
69.00	170
70.00	155
72.00	215
67.00	150
69.00	145
73.00	155
73.00	155
71.00	150
68.00	155
69.50	150
73.00	180
75.00	160
66.00	135
69.00	160
66.00	130
73.00	155
68.00	150
74.00	148
73.50	155
END DATA.

compute case = $casenum.
exe.
format case (f3.0).

var lab
 weight	'Body Weight (lbs)'
 height	'Height (inches)'
 case		'Case number'.


*=== Regression of Weight on Height:  All Cases ====== .

* Compute regression equation using all cases in the file.
* Save predicted scores (i.e., predicted weights) as PRED1.

REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS CI R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN
  /DEPENDENT weight
  /METHOD=ENTER height
  /RESIDUALS HIST(ZRESID) NORM(ZRESID)
  /SAVE PRED (PRED1) .



*=== Regression of Weight on Height:  Use subset of cases (1) ====== .

* This time, use a subset of the cases when computing the regression equation.
* First, compute a binary variable to divide the cases into two groups.
* Arbitrarily, I will set variable FLAG = 1 for case numbers 6 and greater,
* and FLAG = 0 for cases 1-5.

compute FLAG = (case GE 6).
exe.

* One way to compute a regression equation using only the cases with FLAG=1
* is to set a filter.  Let's try this method first.

use all.
filter by FLAG.	/* That is, use only cases with FLAG = 1.
exe.

* Save predicted Y-score as PRED2.

REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS CI R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN
  /DEPENDENT weight
  /METHOD=ENTER height
  /RESIDUALS HIST(ZRESID) NORM(ZRESID)
  /SAVE PRED (PRED2) .

* Notice that no predicted Y-scores were saved for cases 1-5,
* because they were filtered out.



*=== Regression of Weight on Height:  Use subset of cases (2) ====== .

* There is another, perhaps better way to use only a subset of cases
* when computing a regression equation.  The REGRESSION procedure 
* has a /SELECT subcommand, which is used to specify a subset of
* cases you wish to use when computing the regression equation.
* To use it via the pull-down menus, click on Analyze-->Regression-->
* Linear to open the main dialog for REGRESSION, then move your
* FILTER variable (FLAG in our example) into the "Selection Variable"
* box, and click on the Rule button.  Set the Rule as you wish
* (e.g., FLAG = 1 for our example), and then proceed as usual.
* The syntax version is shown below.

* NOTE:  We have a filter set, and need to turn it off before going on.

use all.
filter off.

* Save predicted Y-scores to variable PRED3 .

REGRESSION
  /SELECT= FLAG EQ 1			/* use only cases with FLAG=1  */
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS CI R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN
  /DEPENDENT weight
  /METHOD=ENTER height
  /RESIDUALS HIST(ZRESID) NORM(ZRESID)
  /SAVE PRED (PRED3) .


* Notice that THIS time, predicted scores have been saved for cases 1-5,
* despite the fact that they were not included when the regression
* equation was computed.  

compute diff1 = pred1 - pred2.
compute diff2 = pred2 - pred3.
exe.
var lab 
 pred1 'PRED1 (all cases)'
 pred2 'PRED2 (FILTER method)'
 pred3 'PRED3 (/SELECT method)'
 diff1 'PRED1 - PRED2'
 diff2 'PRED2 - PRED3'.
format diff1 diff2 (f8.5).
descrip pred1 pred2 pred3 diff1 diff2.

* The predicted scores from the first regression (using all cases)
* are not equal to the predicted scores from the 2nd regression 
* (using only cases 6-57 to develop the equation).  The difference
* between these two predicted scores is PRED1-PRED2.

* It should reassure you that for cases 6-57, the predicted scores in 
* variables PRED2 and PRED3 are identical. In other words, only cases 
* 6-57 were included both when we used the FILTER method, and also when 
* we used the /SELECT method.


*====== How/when might you use /SELECT? =========.

* The /SELECT subcommand is very useful if you want to crossvalidate
* a regression model.  Crossvalidation refers to using one set of data
* to develop a regression equation, and to then use that regression
* equation to generate predicted scores for another data set.  The
* correlation between the actual scores and predicted scores in the
* 2nd data set is an index of validity of the regressio equation.
* That is, if the regression equation is valid, the correlation between
* observed and predicted scores in the 2nd data set ought to be nearly
* as good as in the original data set.  (It will never be AS good,
* because as Dave Howell puts it in his book, regression does its best
* to fit "every bump and wiggle" in the data set used to develop
* the equation.

* End of file.

* ======================================================================= .