Multiple Regression Part 1
Estimating
the parameters of a Multiple Linear Regression Model using SAS
The following data is an excerpt from a public data set containing information on employees, sales, and profits figures for several major companies in 1993. This data set is contained in the SAS/Insight sample data library (SASUSER.BUSINESS) and was used to produce this example. A partial listing of the data follows:
OBS COMPANY NATION INDUSTRY EMPLOYS SALES PROFITS
1 Lucas
Industries
2 GKN
3 GEC
4 Grand Metropolitan
5 Unilever
6 Allied-
7 Guinness
8 Hillsdown
Holdings
9 Assoc. British Foods
10 Tate & Lyle
11 Cadbury Schweppes
12 United Biscuits
13 Harrisons Crossfield
14 Unigate
15 Royal Dutch / Shell
16 British Petroleum
17 Renault
18 Peugeot
19 Valeo
20 Alcatel Alsthom
21 Thomson
22 Schneider
23 Danone Group
24 Besnier
25 ELF Aquitaine
26 Total
27
28 Volkswagen
29 Robert Bosch
30 BMW
31 MAN
32 ZF Friedrichshafen
33 Siemens
34 Veba Oel
35 Nissan Motor
36
37 Honda Motor
38 Mitsubishi Motors
39 Mazda Motor
40 Isuzu Motors
We will use this data to predict profits (in millions of
dollars) from a companys sales (in millions of dollars) and the number of
employees of the company (in thousands).
1. Create
a scatterplot matrix illustrating the relationships
among the variables:
2. Create
Partial Regression Plots for each of the variables.
DEFINITION: A partial regression plot (sometimes
called an added-variable plot)
displays the relationship between the response variable, y, and an explanatory variable, xi,
after removing the effect of the other explanatory variables.
You can create partial regression plots in SAS
automatically by using the REG procedures with the PARTIAL option in the model
statement. The code and output for this
data set is given below:
proc reg data=sasuser.business
corr ;
model profits =
employs sales / partial ;
run ;
The
REG Procedure
Model: MODEL1
Partial
Regression Residual Plot
PROFITS
3000
1 1
1 1
2000
1 1
P
r
o 1
f 1000
1 1
i 1
t 1 2 1 11
1
s 1 1 2 1111 11
1 21 1112
i 0 12*555714
n 1 1413421131 1 1 1
1 1 11 1
$ 1 1 1 1 1 1 1
1 2 1
1 1
M -1000 1
i
1
l
1
l
i
o -2000
n 1
s
1
-3000
1
-4000
-250
-200 -150 -100
-50 0 50
100 150 200
250 300 350
Employees
in Thousands EMPLOYS
The SAS System
The
REG Procedure
Model: MODEL1
Partial
Regression Residual Plot
PROFITS
6000
1
P 4000
1
r
o
1
f
i
t
s
1
2000 1 1
i
1
n
1
1
$ 1 1 1
1 1 1
M 221 221 1
i 0 1 11 1153444221 1
l 2*59*31311
12
l 1 11
122 1
i 1 2
1 1
o 1
n
1
s
-2000
1
1
-4000
-60000 -40000
-20000 0 20000
40000 60000 80000
Sales
in $ Millions SALES
Look at the two simple regression
models:
Regression of PROFITS
on SALES:
Regression of PROFITS
on EMPLOYS:
Note: The output shown above is from
SAS/Insight. You could also fit these 2
regression models using the following SAS code:
proc reg data=sasuser.business ;
model profits = sales ;
plot profits*sales ;
proc reg data=sasuser.business ;
model profits = employs ;
plot profits*employs ;
Now look at the partial correlation
matrix and output from regressing PROFITS on EMPLOYS and SALES.
The SAS System
Correlation
CORR EMPLOYS SALES PROFITS
EMPLOYS 1.0000 0.7298 0.3619
SALES 0.7298 1.0000 0.5969
PROFITS 0.3619 0.5969 1.0000
Model: MODEL1
Dependent Variable:
PROFITS Profits
in $ Millions
Analysis of Variance
Sum of Mean
Source
DF Squares Square F Value Prob>F
Model 2 42711485.23 21355742.615 35.503 0.0001
Error
122 73385789.57
601522.86532
C Total
124 116097274.8
Root MSE 775.57905 R-square 0.3679
Dep Mean 442.16000 Adj R-sq 0.3575
C.V. 175.40688
Parameter Estimates
Parameter
Standard T for H0:
Variable DF Estimate Error
Parameter=0 Prob
> |T|
INTERCEP 1
-13.325360 90.80843729 -0.147 0.8836
EMPLOYS 1
-1.553555 1.03671710 -1.499 0.1366
SALES 1
0.030346 0.00448738 6.762 0.0001
3. Based
on the computer printout, we can write the prediction equation as follows:
Ŷ = -13.3254 -
1.5536 X1 + 0.0303 X2
4. Inference for the multiple
regression model:
Overall
test for the model basically asking Is this model worthwhile?
Hypothesis
test:
Ho:
b1 = b2 =
= bk
Ha:
At least one bi not equal to 0
Test
Statistic:
F = MSR / MSE
So,
if F is large, then the MSR is large with respect to the error in the model.
Large
F (small p-value) says to reject Ho.
Rule: If the p-value (Prob
> F) is small, then we reject the null hypothesis and conclude that at least
one of the explanatory variables has some effect on the response variable.
For
this example, the value of F is 35.503 and the p-value is 0.0001. Therefore, we reject the null hypothesis and
conclude that either the amount of sales or the number of employees (or both)
has some effect on profit.
Can
we be more specific and determine which individual variables are significant?
Tests
of
Does
xi have a significant effect on the response variable in
the presence of the other variables?
Hypothesis
test:
Ho:
bi = 0
Ha:
bi Ή 0
Test
Statistic:
Example
from SAS output:
Rule: If the p-value (Prob>|T|)
is small, then we reject the null hypothesis and conclude that there is a
significant relationship between xi and y after controlling for the other explanatory variables in the model.
Constructing
confidence intervals for individual partial regression coefficients:
Example
from SAS output:
To
construct a 95% confidence interval for the partial regression parameter for
sales:
t0.025,122 = 1.96
bi = 0.0303
sbi = 0.0045
CI: 0.0303 ± 0.00882
So,
after controlling for the number of employees, we are 95% confident that the
true value of the partial regression parameter for sales is between 0.039 and
0.021.
Coefficient
of Multiple Determination: R2
Definition: R2 is the proportion of the total
variation in y explained by the simultaneous predictive power of all of the
explanatory variables through the multiple regression model.
R2
= SSR = SST SSE = 1 - SSE
SST SST SST
Properties
of R2:
·
0 <= R2 <= 1
·
represents the proportional reduction of the total variation in y
associated with the use of the set of predictor variables
·
R2 can only increase
as more variables are added to the model
Example:
In
our example, R2 = 0.3679, so we can say that 36.79% of the
variability that we see in profits is due to differences in the amount of sales
and the number of employees.
Adjusted
R2: SAS also includes an adjusted R2 value as part of its
output. This statistic also looks at the
proportion of variation explained by the model.
However, it then adjusts that value based on the number of variables
included in the model.
Adj.
R-square = 1 {SSE/(n-k-1) / SST/(n-1)}
Use
the adjusted R2 value to
compare multiple regression models.