* Display syntax commands in the Output Viewer .
SET Printback=On Length=59 Width=80.

*  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .
*  File:  	linreg.SPS .
*  Date:  	13-March-2001 .
*  Author:  Bruce Weaver, weaverb@mcmaster.ca .
*  Notes:	Demonstration of simple linear regression and correlation
		 using the dataset described in my notes .
*  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .

* NOTE:	I am deliberately using a very small dataset so that it
		is easier for us to see what is happening in the analysis.

* First, read in the 6 X-Y pairs .

DATA LIST LIST /x(f2.0) y(f2.0) .
BEGIN DATA.
20 20
30 50
45 35
60 60
78 45
88 90
END DATA.

var lab
	x	'Spelling score'
	y	'Writing score' .

* List the data, and show descriptive stats on X and Y .
* To produce the following syntax:  ANALYZE-->REPORTS-->Case Summaries .

SUMMARIZE
  /TABLES=x y
  /FORMAT=VALIDLIST NOCASENUM TOTAL LIMIT=100
  /TITLE='Case Summaries'
  /MISSING=VARIABLE
  /CELLS=COUNT MEAN STDDEV VAR .

* To get the following syntax:  ANALYZE-->CORRELATE-->BIVARIATE .

CORRELATIONS
  /VARIABLES=x y
  /PRINT=TWOTAIL SIG
  /MISSING=PAIRWISE .

* Note that the mean of the Y-scores = 50 .
* Create a variable YBAR, and set it = 50 .
* To produce the following syntax:  TRANSFORM-->COMPUTE .

compute ybar = 50.
exe.

* Compute regression equation for predicting Writing Ability
	from Spelling score; ask SPSS to SAVE the predicted Y-score
	for each person, as well as the "residual" score;
	also generate a scatterplot .

* ANALYZE-->REGRESSION-->LINEAR .

REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF CI ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN
  /DEPENDENT y
  /METHOD=ENTER x
  /scatterplot=(y,x)
  /SAVE PRED RESID .

* Note that the p-value for the F-test in the ANOVA of Regression
	table is identical to the 2-tailed p-value for the
	correlation between X and Y we saw earlier .

* SS(Total) = 		2850.000 .
* SS(Regression) = 	1683.762 .
* SS(Residual) = 		1166.238 .

* The regression equation is:  Y-prime = 0.686(X) + 13.307 .

* The /SAVE PRED RESID you see in the Regression syntax shown above
	asked SPSS to save each subject's predicted score as well as
	their residual score (i.e., the error in prediction) .

* Let's compute our own predicted Y-scores using this equation,
	and compare them to the predicted scores SPSS saved as
	variable pre_1 .

* TRANSFORM-->COMPUTE .

compute my_pred = 0.686*X + 13.307 .
compute my_res = y - my_pred.
exe.

var lab 	my_pred 'My predicted Y-score'
		my_res 'My residual score' .

* Now compute difference between the predicted and residual scores
	generated by SPSS, and those we computed ourselves; the 
	difference should be 0, or very close (there may be some
	rounding error in our predictions) .

compute diff1 = pre_1 - my_pred.
compute diff2 = res_1 - my_res.
exe.
var lab 	diff1 '(SPSS Y-prime) - (My Y-prime)'
		diff2 '(SPSS residual) - (My residual)'.

* ANALYZE-->DESCRIPTIVE STATISTICS-->DESCRIPTIVES .

descrip pre_1 my_pred diff1 res_1 my_res diff2.

* As you can see, there is a bit of rounding error in our estimates.
* The funny notation you see in some cells is scientific notation;
	e.g., 8.000E-03 means 8.000 times 10 to the minus 3, or
	0.008 .

* Another way to look at the agreement of our computations with those
	of SPSS is to crosstabulate the variables as follows .

* ANALYZE-->DESCRIPTIVE STATISTICS-->CROSSTABS .

crosstabs 
	/tables pre_1 by my_pred
	/tables res_1 by my_res.

*	Show that SS(Y) = SS(regression) + SS(residual) .

compute sqdevy = (y-ybar)**2.		/*  **2 means 'squared'  */
compute sqdevreg = (pre_1 - ybar)**2.
compute sqdevres = (y - pre_1)**2.
exe.

var lab 	sqdevy 	'(Y - Ybar)**2'
		sqdevreg	'(Yprime - Ybar)**2'
		sqdevres	'(Y - Yprime)**2' .

* ANALYZE-->COMPARE MEANS-->MEANS .

means sqdevy sqdevreg sqdevres
	/cells = count sum mean.

* Compare the SUMS shown above to the Sums of Squares in the 
	ANOVA table generated by the REGRESSION command; they are 
	identical .

*  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .

* REGRESSION OF X ON Y .

* If you wished to predict X from Y, you would have to compute
	a different equation, because the errors in prediction
	would be measured differently (i.e., errors would be
	measured on the X-axis rather than the Y-axis).

* Regression of X on Y (i.e., predicting X from Y) .

REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF CI ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN
  /DEPENDENT x
  /METHOD=ENTER y .

* SS(Total) = 		3579.000 .
* SS(Regression) = 	2114.746 .
* SS(Residual) = 		1464.754 .

* NOTE that the sum of the squared errors in prediction for
	the regression of X on Y, or SS(residual), is not the
	same as for the regression of Y on X; this is because
	we are now partitioning SS(X) rather than SS(Y), and so
	we have a completely different set of error scores (or
	residuals) .

* The regression equation is:  X-prime = 0.861(Y) + 10.430 .

*	Finished.

* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .

* ADDENDUM:  USING INTERACTIVE GRAPHICS TO MAKE SCATTERPLOTS .

* You can use Interactive Graphics to produce a scatterplot
	WITH the regression line included; here's some syntax
	I generated by going to Graphs-->Interactive .

* GRAPHS-->INTERACTIVE-->SCATTERPLOT .

IGRAPH 
  /VIEWNAME='Scatterplot' 
  /X1 = VAR(x) TYPE = SCALE 
  /Y = VAR(y) TYPE = SCALE 
  /COORDINATE = VERTICAL  
  /FITLINE METHOD = REGRESSION LINEAR INTERVAL(95.0) = MEAN INDIVIDUAL LINE 
	= TOTAL SPIKE=OFF 
  /TITLE='Writing Ability' + ' as a Function of Spelling Competence' 
  /X1LENGTH = 3.0 /YLENGTH = 3.0
  /X2LENGTH = 3.0 
  /CHARTLOOK = 'Grayscale.clo' 
  /SCATTER COINCIDENT = NONE.
EXE.

*	The preceding figure includes 2 intervals around the 
	 regression line.  The wider interval is the 95% confidence
	 interval for making a prediction about a single case with
	 a given value of X.

*	The narrower interval is the 95% confidence interval for
	 the MEAN of all cases with a given value of X .

*	Which interval is appropriate depends on what you are
*	 using it for.  You don't need to know about this for
*	 this particular course.  But if anyone needs to know
*	 more about it at a later date, here's a website with
*	 some more information:

*	http://courses.washington.edu/qsci483/lab3/  .

* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .