Chapter 11: Multiple Regression

Multiple Regression – Part 1

Estimating the parameters of a Multiple Linear Regression Model using SAS

The following data is an excerpt from a public data set containing information on employees, sales, and profits figures for several major companies in 1993. This data set is contained in the SAS/Insight sample data library (SASUSER.BUSINESS) and was used to produce this example. A partial listing of the data follows:

OBS COMPANY NATION INDUSTRY EMPLOYS SALES PROFITS

1 Lucas Industries Britain Automobiles 46 $3,864 $39

2 GKN Britain Automobiles 27 $3,037 $58

3 GEC Britain Electronics 93 $9,491 $907

4 Grand Metropolitan Britain Food 87 $11,164 $629

5 Unilever Britain Food 303 $41,843 $1,945

6 Allied-Lyons Britain Food 71 $7,231 $488

7 Guinness Britain Food 23 $7,006 $650

8 Hillsdown Holdings Britain Food 43 $6,900 $142

9 Assoc. British Foods Britain Food 50 $6,798 $353

10 Tate & Lyle Britain Food 16 $5,633 $227

11 Cadbury Schweppes Britain Food 39 $5,594 $365

12 United Biscuits Britain Food 39 $5,174 $101

13 Harrisons Crossfield Britain Food 31 $3,319 $91

14 Unigate Britain Food 26 $2,978 $110

15 Royal Dutch / Shell Britain Oil 116 $95,136 $4,504

16 British Petroleum Britain Oil 73 $52,485 $924

17 Renault France Automobiles 140 $29,977 $189

18 Peugeot France Automobiles 144 $25,670 $-258

19 Valeo France Automobiles 25 $3,572 $125

20 Alcatel Alsthom France Electronics 197 $27,600 $1,248

21 Thomson France Electronics 100 $11,917 .

22 Schneider France Electronics 91 $9,953 $52

23 Danone Group France Food 56 $12,377 $604

24 Besnier France Food 12 $4,103 $78

25 ELF Aquitaine France Oil 94 $37,016 $189

26 Total France Oil 50 $23,917 $523

27 Daimler-Benz Germany Automobiles 366 $59,102 $364

28 Volkswagen Germany Automobiles 252 $46,312 $-1,232

29 Robert Bosch Germany Automobiles 157 $19,634 $258

30 BMW Germany Automobiles 71 $17,546 $317

31 MAN Germany Automobiles 61 $12,106 $142

32 ZF Friedrichshafen Germany Automobiles 27 $3,167 $34

33 Siemens Germany Electronics 391 $50,381 $1,113

34 Veba Oel Germany Oil 7 $6,246 $-1

35 Nissan Motor Japan Automobiles 143 $53,760 $-805

36 Toyota Motor Japan Automobiles 109 $85,283 $1,474

37 Honda Motor Japan Automobiles 91 $35,798 $220

38 Mitsubishi Motors Japan Automobiles 46 $27,311 $52

39 Mazda Motor Japan Automobiles 33 $20,279 $-454

40 Isuzu Motors Japan Automobiles 13 $13,731 $-38

We will use this data to predict profits (in millions of dollars) from a company’s sales (in millions of dollars) and the number of employees of the company (in thousands).

1. Create a scatterplot matrix illustrating the relationships among the variables:

2. Create Partial Regression Plots for each of the variables.

DEFINITION: A partial regression plot (sometimes called an added-variable plot) displays the relationship between the response variable, y, and an explanatory variable, x_i, after removing the effect of the other explanatory variables.

You can create partial regression plots in SAS automatically by using the REG procedures with the PARTIAL option in the model statement. The code and output for this data set is given below:

proc reg data=sasuser.business corr ;

model profits = employs sales / partial ;

run ;

The REG Procedure

Model: MODEL1

Partial Regression Residual Plot

„ƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒ†

PROFITS ‚ ‚

‚ ‚

3000 ˆ ˆ

‚ 1 1 ‚

‚ ‚

‚ 1 1 ‚

2000 ˆ ˆ

‚ 1 1 ‚

P ‚ ‚

r ‚ ‚

o ‚ 1 ‚

f 1000 ˆ 1 1 ˆ

i ‚ 1 ‚

t ‚ 1 2 1 11 1 ‚

s ‚ 1 1 2 1111 11 ‚

‚ 1 21 1112 ‚

i 0 ˆ 12*555714 ˆ

n ‚ 1 1413421131 1 1 1 ‚

‚ 1 1 11 1 ‚

$ ‚ 1 1 1 1 1 1 1 ‚

‚ 1 2 1 1 1 ‚

M -1000 ˆ 1 ˆ

i ‚ 1 ‚

l ‚ 1 ‚

l ‚ ‚

i ‚ ‚

o -2000 ˆ ˆ

n ‚ 1 ‚

s ‚ 1 ‚

‚ ‚

-3000 ˆ ˆ

‚ ‚

‚ 1 ‚

‚ ‚

-4000 ˆ ˆ

‚ ‚

ŠƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒŒ

-250 -200 -150 -100 -50 0 50 100 150 200 250 300 350

Employees in Thousands EMPLOYS

The SAS System 15:48 Sunday, February 9, 2003 4

The REG Procedure

Model: MODEL1

Partial Regression Residual Plot

„ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ†

PROFITS ‚ ‚

‚ ‚

6000 ˆ ˆ

‚ ‚

‚ 1 ‚

‚ ‚

P 4000 ˆ 1 ˆ

r ‚ ‚

o ‚ 1 ‚

f ‚ ‚

i ‚ ‚

t ‚ ‚

s ‚ 1 ‚

2000 ˆ 1 1 ˆ

i ‚ 1 ‚

n ‚ 1 ‚

‚ 1 ‚

$ ‚ 1 1 1 ‚

‚ 1 1 1 ‚

M ‚ 221 221 1 ‚

i 0 ˆ 1 11 1153444221 1 ˆ

l ‚ 2*59*31311 12 ‚

l ‚ 1 11 122 1 ‚

i ‚ 1 2 1 1 ‚

o ‚ 1 ‚

n ‚ 1 ‚

s ‚ ‚

-2000 ˆ ˆ

‚ 1 ‚

‚ ‚

‚ 1 ‚

‚ ‚

-4000 ˆ ˆ

‚ ‚

ŠƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒŒ

-60000 -40000 -20000 0 20000 40000 60000 80000

Sales in $ Millions SALES
Look at the two simple regression models:

Regression of PROFITS on SALES:

Regression of PROFITS on EMPLOYS:

Note: The output shown above is from SAS/Insight. You could also fit these 2 regression models using the following SAS code:

proc reg data=sasuser.business ;

model profits = sales ;

plot profits*sales ;

proc reg data=sasuser.business ;

model profits = employs ;

plot profits*employs ;
Now look at the partial correlation matrix and output from regressing PROFITS on EMPLOYS and SALES.

The SAS System

Correlation

CORR EMPLOYS SALES PROFITS

EMPLOYS 1.0000 0.7298 0.3619

SALES 0.7298 1.0000 0.5969

PROFITS 0.3619 0.5969 1.0000

Model: MODEL1

Dependent Variable: PROFITS Profits in $ Millions

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Prob>F

Model 2 42711485.23 21355742.615 35.503 0.0001

Error 122 73385789.57 601522.86532

C Total 124 116097274.8

Root MSE 775.57905 R-square 0.3679

Dep Mean 442.16000 Adj R-sq 0.3575

C.V. 175.40688

Parameter Estimates

Parameter Standard T for H0:

Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 -13.325360 90.80843729 -0.147 0.8836

EMPLOYS 1 -1.553555 1.03671710 -1.499 0.1366

SALES 1 0.030346 0.00448738 6.762 0.0001

3. Based on the computer printout, we can write the prediction equation as follows:

Ŷ = -13.3254 - 1.5536 X₁ + 0.0303 X₂

4. Inference for the multiple regression model:

Overall test for the model – basically asking “Is this model worthwhile?”

Hypothesis test:

H_o: b₁ = b₂ = … = b_k

H_a: At least one b_i not equal to 0

Test Statistic:

F = MSR / MSE

So, if F is large, then the MSR is large with respect to the error in the model.

Large F (small p-value) says to reject Ho.

Rule: If the p-value (Prob > F) is small, then we reject the null hypothesis and conclude that at least one of the explanatory variables has some effect on the response variable.

For this example, the value of F is 35.503 and the p-value is 0.0001. Therefore, we reject the null hypothesis and conclude that either the amount of sales or the number of employees (or both) has some effect on profit.

Can we be more specific and determine which individual variables are significant?

Tests of Independence for Individual Explanatory Variables

Does x_i have a significant effect on the response variable in the presence of the other variables?

Hypothesis test:

H_o: b_i = 0

H_a: b_i ¹ 0

Test Statistic:

Example from SAS output:

Rule: If the p-value (Prob>|T|) is small, then we reject the null hypothesis and conclude that there is a significant relationship between x_iand y after controlling for the other explanatory variables in the model.

Constructing confidence intervals for individual partial regression coefficients:

Example from SAS output:

To construct a 95% confidence interval for the partial regression parameter for sales:

t_0.025,122 = 1.96

b_i = 0.0303

s_bi = 0.0045

CI: 0.0303 ± 0.00882

So, after controlling for the number of employees, we are 95% confident that the true value of the partial regression parameter for sales is between 0.039 and 0.021.

Coefficient of Multiple Determination: R²

Definition: R² is the proportion of the total variation in y explained by the simultaneous predictive power of all of the explanatory variables through the multiple regression model.

R² = SSR = SST – SSE = 1 - SSE

SST SST SST

Properties of R²:

· 0 <= R² <= 1

· represents the proportional reduction of the total variation in y associated with the use of the set of predictor variables

· R² can only increase as more variables are added to the model

Example:

In our example, R² = 0.3679, so we can say that 36.79% of the variability that we see in profits is due to differences in the amount of sales and the number of employees.

Adjusted R²: SAS also includes an adjusted R² value as part of its output. This statistic also looks at the proportion of variation explained by the model. However, it then adjusts that value based on the number of variables included in the model.

Adj. R-square = 1 – {SSE/(n-k-1) / SST/(n-1)}

Use the adjusted R² value to compare multiple regression models.