Analysis of Regression using S-plus/ R with DiagnosticsMohammad Ehsanul Karim <mailto:wildscop@yahoo.com>
University of Dhaka, Bangladesh |
Contents: |
Simple Linear Regression
|
#####################################################################
################ Simple Linear Regression in R 1.3.1
################
#####################################################################
Back to TOP
> x <- c(1.15, 1.9, 3, 3, 3, 3, 3, 5.34, 5.38, 5.4, 5.4, 5.45, 7.7, 7.8, 7.81,
7.85, 7.87, 7.91, 7.94, 9.03, 9.07, 9.11, 9.14, 9.16, 9.37, 10.17, 10.18, 10.22,
10.22, 10.22, 10.18, 10.5, 10.23, 10.03, 10.23)
> y <- c(0.99, 0.98, 2.6, 2.67, 2.66, 2.78, 2.8, 5.92,
5.35, 4.33, 4.89, 5.21, 7.68, 9.81, 6.52, 9.71, 9.82, 9.81, 8.5, 9.47, 11.45,
12.14, 11.5, 10.65, 10.64, 9.78, 12.39, 11.03, 8, 11.9, 8.68, 7.25, 13.46,
10.19, 9.93)
> w <- c(1.24028, 2.18224, 7.8493, 7.8493, 7.8493,
7.8493, 7.8493, 7.43652, 6.99309, 6.78574, 6.78574, 6.30514, 0.89204, 0.8442,
0.83963, 0.82171, 0.81296, 0.79588, 0.78342, 0.47385, 0.46621, 0.45878, 0.45327,
0.44968, 0.41435, 0.31182, 0.31079, 0.30672, 0.30672, 0.30672, 0.31079, 0.20833,
0.30571, 0.3268, 0.30571)
> g.lm <- lm(y~x)
> plot(x,y,pch="*", main="Scatter Plot of \nIndependent Variable and Dependent Variable", sub="DATA:Applied Regression Analysis:2nd ed -N.Draper,H. Smith; Pp 113", xlab="Independent variable", ylab="Dependent variable", xlim=c(0,12), ylim=c(0,15), type="p", axes =T, col=1)
> plot(fitted(g.lm), resid(g.lm),pch="*",
main="Residual Plot of \nResidual against Fitted value", sub="DATA:Applied
Regression Analysis:2nd ed -N.Draper, H.Smith; Pp 113", xlab="Fitted value",
ylab="Residuals", xlim=c(0,12), ylim=c(-5,3), type="p", axes =T, col=1)
> plot(x, resid(g.lm),pch="*", main="Residual Plot of \nResidual
against Independent Variable", sub="DATA:Applied Regression Analysis:2nd ed -N.Draper,
H.Smith; Pp 113", xlab="Independent Variable", ylab="Residuals", xlim=c(0,12),
ylim=c(-5,3), type="p", axes =T, col=1)
> # The residual plot of residual against independent x shows that error variance tends to increase with x. The residuals tend to lie in a band that diverges as one moves along x axis. In general, if the band within which the residuals lie diverges (becomes wider) as x increases, the error variance is also increasing with x. On the other hand, if the band converges (becomes narrower), the error variance decreases with x. The graph of residual against the independent variable {> plot(x, resid(g.lm))} just plotted here - points up the presence of heteroscedastic errors since residuals increase with x.
> g <-cbind(g,fitted(g.lm))
> g <-cbind(g,resid(g.lm))
> i <- c("* 01", "* 02", "* 03", "* 04", "* 05", "*
06", "* 07", "* 08", "* 09", "* 10", "* 11", "* 12", "* 13", "* 14", "* 15", "*
16", "* 17", "* 18", "* 19", "* 20", "* 21", "* 22", "* 23", "* 24", "* 25", "*
26", "* 27", "* 28", "* 29", "* 30", "* 31", "* 32", "* 33", "* 34", "* 35")
> dimnames(g)<-list(i,c("Indep.X","Dep.Y","Weight.W","Pred.Yhat","Res.e"))
> g
Indep.X Dep.Y Weight.W Pred.Yhat Res.e
* 01 1.15 0.99 1.24028 0.7267608 0.263239162
* 02 1.90 0.98 2.18224 1.5783137 -0.598313733
* 03 3.00 2.60 7.84930 2.8272580 -0.227257980
* 04 3.00 2.67 7.84930 2.8272580 -0.157257980
* 05 3.00 2.66 7.84930 2.8272580 -0.167257980
* 06 3.00 2.78 7.84930 2.8272580 -0.047257980
* 07 3.00 2.80 7.84930 2.8272580 -0.027257980
* 08 5.34 5.92 7.43652 5.4841030 0.435896985
* 09 5.38 5.35 6.99309 5.5295192 -0.179519169
* 10 5.40 4.33 6.78574 5.5522272 -1.222227246
* 11 5.40 4.89 6.78574 5.5522272 -0.662227246
* 12 5.45 5.21 6.30514 5.6089974 -0.398997439
* 13 7.70 7.68 0.89204 8.1636561 -0.483656126
* 14 7.80 9.81 0.84420 8.2771965 1.532803488
* 15 7.81 6.52 0.83963 8.2885506 -1.768550551
* 16 7.85 9.71 0.82171 8.3339667 1.376033295
* 17 7.87 9.82 0.81296 8.3566748 1.463325218
* 18 7.91 9.81 0.79588 8.4020909 1.407909063
* 19 7.94 8.50 0.78342 8.4361531 0.063846947
* 20 9.03 9.47 0.47385 9.6737433 -0.203743261
* 21 9.07 11.45 0.46621 9.7191594 1.730840585
* 22 9.11 12.14 0.45878 9.7645756 2.375424430
* 23 9.14 11.50 0.45327 9.7986377 1.701362314
* 24 9.16 10.65 0.44968 9.8213458 0.828654237
* 25 9.37 10.64 0.41435 10.0597806 0.580219426
* 26 10.17 9.78 0.31182 10.9681037 -1.188103662
* 27 10.18 12.39 0.31079 10.9794577 1.410542299
* 28 10.22 11.03 0.30672 11.0248739 0.005126145
* 29 10.22 8.00 0.30672 11.0248739 -3.024873855
* 30 10.22 11.90 0.30672 11.0248739 0.875126145
* 31 10.18 8.68 0.31079 10.9794577 -2.299457701
* 32 10.50 7.25 0.20833 11.3427869 -4.092786936
* 33 10.23 13.46 0.30571 11.0362279 2.423772106
* 34 10.03 10.19 0.32680 10.8091471 -0.619147122
* 35 10.23 9.93 0.30571 11.0362279 -1.106227894
> qqnorm(resid(g.lm)) # Check this
> qqline(resid(g.lm)) # Check this
> summary(g.lm)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-4.09279 -0.60873 -0.04726 1.12558 2.42377
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.57895 0.67919 -0.852 0.4
x 1.13540 0.08622 13.169 1.09e-14 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1.457 on 33 degrees of freedom
Multiple R-Squared: 0.8401, Adjusted R-squared: 0.8353
F-statistic: 173.4 on 1 and 33 DF, p-value: 1.088e-014
> anova(g.lm)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x 1 367.95 367.95 173.42 1.088e-14 ***
Residuals 33 70.02 2.12
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
> sqrt(sum((Dep.Y-fitted(g.lm))^2)/(n-2))
[1] 1.673508
###################################################################
################ Weighted Least Squares in R 1.3.1
################
###################################################################
Back to TOP
> # Since clear indication of heteroscedasticity has been observed from the plots (shown above) none of the usual least squares methods are appropriate - it is sensible if one goes for Generalized least squares techniques. Weighting is one of it. Let us see what we can do by weighting.
> g.fr <- data.frame(g)
> attach(g.fr)
> g.glm <- glm(Dep.Y~Indep.X, weights = Weight.W)
> summary(g.glm)
Call:
glm(formula = Dep.Y ~ Indep.X, weights = Weight.W)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8079 -0.5585 0.1655 0.9662 1.6208
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.90780 0.29666 -3.06 0.00437 **
Indep.X 1.16958 0.05876 19.91 < 2e-16 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for gaussian family taken to be 1.255957)
Null deviance: 539.072 on 34 degrees of freedom
Residual deviance: 41.447 on 33 degrees of freedom
AIC: 190.00
Number of Fisher Scoring iterations: 2
> g1 <-cbind(g,fitted(g.glm))
> g1 <-cbind(g1,resid(g.glm))
> g1
Indep.X Dep.Y Weight.W Pred.Yhat Res.e Pred.What Res.We
* 01 1.15 0.99 1.24028 0.7267608 0.263239162 0.4372125 0.615627624
* 02 1.90 0.98 2.18224 1.5783137 -0.598313733 1.3143949 -0.493981735
* 03 3.00 2.60 7.84930 2.8272580 -0.227257980 2.6009291 -0.002603012
* 04 3.00 2.67 7.84930 2.8272580 -0.157257980 2.6009291 0.193513204
* 05 3.00 2.66 7.84930 2.8272580 -0.167257980 2.6009291 0.165496602
* 06 3.00 2.78 7.84930 2.8272580 -0.047257980 2.6009291 0.501695828
* 07 3.00 2.80 7.84930 2.8272580 -0.027257980 2.6009291 0.557729033
* 08 5.34 5.92 7.43652 5.4841030 0.435896985 5.3377382 1.587826912
* 09 5.38 5.35 6.99309 5.5295192 -0.179519169 5.3845213 -0.091289636
* 10 5.40 4.33 6.78574 5.5522272 -1.222227246 5.4079128 -2.807903913
* 11 5.40 4.89 6.78574 5.5522272 -0.662227246 5.4079128 -1.349134547
* 12 5.45 5.21 6.30514 5.6089974 -0.398997439 5.4663916 -0.643800380
* 13 7.70 7.68 0.89204 8.1636561 -0.483656126 8.0979389 -0.394734362
* 14 7.80 9.81 0.84420 8.2771965 1.532803488 8.2148965 1.465586762
* 15 7.81 6.52 0.83963 8.2885506 -1.768550551 8.2265923 -1.563773159
* 16 7.85 9.71 0.82171 8.3339667 1.376033295 8.2733754 1.302274683
* 17 7.87 9.82 0.81296 8.3566748 1.463325218 8.2967669 1.373412388
* 18 7.91 9.81 0.79588 8.4020909 1.407909063 8.3435500 1.308250984
* 19 7.94 8.50 0.78342 8.4361531 0.063846947 8.3786372 0.107419406
* 20 9.03 9.47 0.47385 9.6737433 -0.203743261 9.6534757 -0.126298723
* 21 9.07 11.45 0.46621 9.7191594 1.730840585 9.7002587 1.194715845
* 22 9.11 12.14 0.45878 9.7645756 2.375424430 9.7470418 1.620829563
* 23 9.14 11.50 0.45327 9.7986377 1.701362314 9.7821291 1.156562241
* 24 9.16 10.65 0.44968 9.8213458 0.828654237 9.8055206 0.566292523
* 25 9.37 10.64 0.41435 10.0597806 0.580219426 10.0511317 0.379054659
* 26 10.17 9.78 0.31182 10.9681037 -1.188103662 10.9867930 -0.673883384
* 27 10.18 12.39 0.31079 10.9794577 1.410542299 10.9984887 0.775747260
* 28 10.22 11.03 0.30672 11.0248739 0.005126145 11.0452718 -0.008457863
* 29 10.22 8.00 0.30672 11.0248739 -3.024873855 11.0452718 -1.686541785
* 30 10.22 11.90 0.30672 11.0248739 0.875126145 11.0452718 0.473368213
* 31 10.18 8.68 0.31079 10.9794577 -2.299457701 10.9984887 -1.292523669
* 32 10.50 7.25 0.20833 11.3427869 -4.092786936 11.3727532 -1.881755723
* 33 10.23 13.46 0.30571 11.0362279 2.423772106 11.0569675 1.328661831
* 34 10.03 10.19 0.32680 10.8091471 -0.619147122 10.8230522 -0.361893322
* 35 10.23 9.93 0.30571 11.0362279 -1.106227894 11.0569675 -0.623112167
> g1[,"Pred.What"]
[1] 0.4372125 1.3143949 2.6009291 2.6009291 2.6009291 2.6009291
[7] 2.6009291 5.3377382 5.3845213 5.4079128 5.4079128 5.4663916
[13] 8.0979389 8.2148965 8.2265923 8.2733754 8.2967669 8.3435500
[19] 8.3786372 9.6534757 9.7002587 9.7470418 9.7821291 9.8055206
[25] 10.0511317 10.9867930 10.9984887 11.0452718 11.0452718 11.0452718
[31] 10.9984887 11.3727532 11.0569675 10.8230522 11.0569675
> g1[,"Res.We"]
[1] 0.615627624 -0.493981735 -0.002603012 0.193513204 0.165496602
[6] 0.501695828 0.557729033 1.587826912 -0.091289636 -2.807903913
[11] -1.349134547 -0.643800380 -0.394734362 1.465586762 -1.563773159
[16] 1.302274683 1.373412388 1.308250984 0.107419406 -0.126298723
[21] 1.194715845 1.620829563 1.156562241 0.566292523 0.379054659
[26] -0.673883384 0.775747260 -0.008457863 -1.686541785 0.473368213
[31] -1.292523669 -1.881755723 1.328661831 -0.361893322 -0.623112167
> plot(fitted(g.glm), resid(g.glm)) # Check this : this
will have random pattern.
> plot(Indep.X, resid(g.glm)) # Check this : this will
have random pattern.
> rss <- ((sum(w*x*y))^2)/(sum(w*x^2))
> tss <- sum(w*y^2)
> ess <- tss-rss
> rsq<- rss/tss
> rss
[1] 2277.728
> tss
[1] 2330.935
> ess
[1] 53.20724
> rsq
[1] 0.9771734
> # Since R-square value is very high, so the employment of weighted least squares appears to be justified.
> sqrt(sum((Dep.Y-fitted(g.glm))^2)/(n-2))
[1] 1.680021
> detach(g.fr)
#################################################################
################ Identifying Outliers in R 1.3.1
################
#################################################################
Back to TOP
> plot(fitted(g.lm), studres(g.lm))
> identify(fitted(g.lm), studres(g.lm), row.names(g))
###########################################################
################ Transformation in R 1.3.1 ################
########################################################### Back
to TOP
> # In our plot of residual against the independent
variable {> plot(x, resid(g.lm))} points up the presence of heteroscedastic
errors since residuals increase with x. based on this observation, we
hypothesize in the present situation that the standard deviation of the
residuals is proportional to x as indicated by our plot. Therefore we consider
new model:
> # Y/X = [intercept]/X + [slope] + [error term]/X
> # which is obtained by dividing both sides of usual linear model by X so that
in this transformed model the variance of [error term]/X is constant according
to our hypothesis. If our assumption about the error term holds, to fit the
model properly, we must work with the transformed variables : Y/X and 1/x as
dependent and independent variables, respectively. If the fitted model for the
transformed data is -
> # Fitted(Y/X) = [estimated-intercept as slope]/X + [estimated-slope as
intercept]
> # then the fitted model in terms of the original variables is-
> # Fitted(Y) = [estimated-intercept] + [estimated-slope] * X
> g.fr <- data.frame(g)
> attach(g.fr)
> g.fr$newy <- g.fr$Dep.Y/g.fr$Indep.X
> g.fr$newx <- 1/g.fr$Indep.X
> new.lm <- lm(g.fr$newy~g.fr$newx)
> summary(new.lm)
Call:
lm(formula = g.fr$newy ~ g.fr$newx)
Residuals:
Min 1Q Median 3Q Max
-0.38465 -0.08668 -0.01036 0.16209 0.26591
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.13036 0.04589 24.632 < 2e-16 ***
g.fr$newx -0.58001 0.18998 -3.053 0.00446 **
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.1729 on 33 degrees of freedom
Multiple R-Squared: 0.2202, Adjusted R-squared: 0.1966
F-statistic: 9.32 on 1 and 33 DF, p-value: 0.004455
> # Therefore, we have
> # Fitted(Y/X) = [-0.580]/X + [1.130]
> # and in terms of the original variables -
> # Fitted(Y) = [-0.580] + [1.130] * X
> fittedy<-(-0.580 + 1.130 * x)
> residuals <- y - fittedy
> # plot residuals vs fittedy and then see what
happens! I leave this for the readers! Hint: The plot is not satisfactory at
all! This also shows heteroscedasticity.
> newx
[1] 0.86956522 0.52631579 0.33333333 0.33333333 0.33333333 0.33333333
[7] 0.33333333 0.18726592 0.18587361 0.18518519 0.18518519 0.18348624
[13] 0.12987013 0.12820513 0.12804097 0.12738854 0.12706480 0.12642225
[19] 0.12594458 0.11074197 0.11025358 0.10976948 0.10940919 0.10917031
[25] 0.10672359 0.09832842 0.09823183 0.09784736 0.09784736 0.09784736
[31] 0.09823183 0.09523810 0.09775171 0.09970090 0.09775171
> newy
[1] 0.8608696 0.5157895 0.8666667 0.8900000 0.8866667 0.9266667 0.9333333
[8] 1.1086142 0.9944238 0.8018519 0.9055556 0.9559633 0.9974026 1.2576923
[15] 0.8348271 1.2369427 1.2477764 1.2402023 1.0705290 1.0487265 1.2624035
[22] 1.3326015 1.2582057 1.1626638 1.1355390 0.9616519 1.2170923 1.0792564
[29] 0.7827789 1.1643836 0.8526523 0.6904762 1.3157380 1.0159521 0.9706745
> plot(g.fr$newy~g.fr$newx)
> detach(g.fr)
> # For any comment / suggestion / correction - feel
free to contact the author (!) of this document at
<mailto:wildscop@yahoo.com>
################################################################################
###### REFERENCE
###############################################################
################################################################################
DATA / ANALYSIS SOURCE : "Applied Regression Analysis": Second Edition - by
Norman R. Draper And Harry Smith; Chapter 2; Section 2.11; Page 113; John Wiley
& Sons, Inc, 1981.
STATISTICAL PROCEDURES : "Regression Analysis by Example": Second Edition - by
Chatterjee and Price; Chapter 2, Section 2.11, Page 49; John Wiley, 1991.
S-plus 4 / R CODE HELP : "Modern Applied Statistics with S-Plus": Second Edition
- by W.N. Venables And B.D. Ripley; Chapter 6; Section 6.3; Page 204; Springer,
1997.
Go Back to |