The following series of posts appeared in newsgroups sci.stat.consult and sci.stat.edu in May 2000. --------------------------------------------------------------------------- From: Mike Date: 2000/05/11 Subject: normality and regression analysis Newsgroups: sci.stat.consult, sci.stat.edu I would like to obtain a prediction equation using linear regression for some data that I have collected. I have read in some stats books that linear regression has 4 assumptions, 2 of them being that 1) data is normally distributed and 2) constant variance. In SAS, I have run univariate analysis testing for normality on both my dependent and independent variable (n=147). Both variables have distributions that are skewed. For the dependent variable: skewness=0.69 and Kurtosis=0.25. For the independent variable: skewness=0.52 and Kurtosis= -0.47. The normality test (Shapiro-Wilk Statistic) states that both the dependent and independent variables are not normally distributed. I have also transformed the data (both dependent and independent variables) using log, arcsine, and square root transformations. When I run the normality tests on the transformed data, the test shows that even the transformed data is not normally distributed. I realize that I can use nonparametric tests for correlation (I will use Spearman), but is there a nonparametric linear regression? If not, is it acceptable to use linear regression analysis on data that is not normally distributed as a way to show there is a linear relationship? thanks in advance..Mike --------------------------------------------------------------------------- From: Jon Cryer Date: 2000/05/11 Subject: Re: normality and regression analysis Newsgroups: sci.stat.consult, sci.stat.edu Mike: It's really the error terms in the regression model that are required to have normal distributions with constant variance. We check this by looking at the properties of the residuals from the regression. You shouldn't expect the response (dependent) variable to have a normal distribution with a fixed mean since then you wouldn't be doing regression. By the way, you have a fine Statistics Department at VPI. I am sure they do excellent consulting. Jon Cryer --------------------------------------------------------------------------- From: Alan Miller Date: 2000/05/12 Subject: Re: normality and regression analysis Newsgroups: sci.stat.consult, sci.stat.edu The importance of normality is grossly over-emphasized. If you have grossly long tails - in the residuals - or if there are outliers, then you may have to look at more robust forms of regression than least squares. Auto-correlations amongst consecutive observations, or lack of homogeneity of variance are often more important than non-normality. There is absolutely no requirement that the predictors (or independent variables) should have a normal distribution, in fact the opposite. Ideally, the predictors should be from a designed experiment and hence will not even be random. Most of them should be towards the outer bounbaries of the predictor space. I guess that most designed experiments would give negative kurtoses if you go through the mechanics of calculating coefficients of kurtosis. If the predictors are from a multivariate normal distribution, then there is far too much clustering in the centre of the design space. Quite often, some of the predictors are discrete - perhaps (0,1) variables, and hence cannot have normal distributions. -- Alan Miller, Retired Scientist (Statistician) CSIRO Mathematical & Information Sciences Alan.Miller -at- vic.cmis.csiro.au http://www.ozemail.com.au/~milleraj http://users.bigpond.net.au/amiller/ --------------------------------------------------------------------------- From: Gary McClelland Date: 2000/05/12 Subject: Re: normality and regression analysis Newsgroups: sci.stat.consult, sci.stat.edu In reply to Mike's question Allan makes the important point: > > There is absolutely no requirement that the predictors (or > independent variables) should have a normal distribution, in fact > the opposite. Ideally, the predictors should be from a designed > experiment and hence will not even be random. Most of them > should be towards the outer bounbaries of the predictor space. > I guess that most designed experiments would give negative > kurtoses if you go through the mechanics of calculating > coefficients of kurtosis. If the predictors are from a multivariate > normal distribution, then there is far too much clustering in the > centre of the design space. Quite often, some of the predictors > are discrete - perhaps (0,1) variables, and hence cannot have > normal distributions. Few researchers using multiple regression realize that normal distributions for their predictors are, in terms of statistical power, about the worst-case scenario. And the power problems exacerbate when testing interactions and polynomial terms. In McClelland & Judd (1993, Psych Bulletin, 114:376-390) we show that to achieve equivalent statistical power for detecting the linear-by-linear interaction, a study with normally distributed predictors requires over 16 times as many observations as a designed experiment. And contrary to most researchers' expectations, skewnews helps and even correlation between predictors helps because it avoids the "too much clustering in the centre of the design space" that Allan refers to. gary --------------------------------------------------------------------------- From: Dan Bonnick Date: 2000/05/12 Subject: Re: normality and regression analysis Newsgroups: sci.stat.consult, sci.stat.edu Hi Mike. For the most popular linear regression Ordinary least squares (OLS), you also need to have your X variable (i.e. the independent variable) having a relatively small error. Your initial work suggests large-ish error in both variables with non-normal error structure. This makes things a little more complicated. Do you suspect that the errors correlated (i.e. if error in X is +d, then is error in Y usually +f ), are they of similar size are they changing with the size of X ? A formal way forward may be the use of structural equations (SAS PROC MIXED may help). IF you just need something quite reliable, then there are some non-parametric regression methods (Deming, Passing+Bablock) which give you quite robust estimates... For a hint, try forming the set of all slopes from joining any two points in your set, then look at the median of the set of slopes (and then the 95th and 5th percentile slope estimates). Then try the same for intercepts. This should give you a workable figure... HTH Dan --------------------------------------------------------------------------- From: Herman Rubin Date: 2000/05/12 Subject: Re: normality and regression analysis Newsgroups: sci.stat.consult, sci.stat.edu In article <8ffek1$1q2$1@solaris.cc.vt.edu>, Mike wrote: >I would like to obtain a prediction equation using linear regression for >some data that I have collected. I have read in some stats books that >linear regression has 4 assumptions, 2 of them being that 1) data is >normally distributed and 2) constant variance. In SAS, I have run >univariate analysis testing for normality on both my dependent and >independent variable (n=147). Both variables have distributions that are >skewed. There is no reason to assume that the data are normal. For linear regression to be exactly the MLE procedure, it is the residuals from the true regression which need to have certain properties. In well designed experiments, the independent variables are never normal. Rarely will the dependent variables be close to normal, either. The key properties for the residuals are lack of correlation with the independent variables, independence, and homoscedasticity. Normality is well down the list. Remember that this is for the residuals, not the data. Linearity of the model is a consequence of these. >For the dependent variable: skewness=0.69 and Kurtosis=0.25. >For the independent variable: skewness=0.52 and Kurtosis= -0.47. >The normality test (Shapiro-Wilk Statistic) states that both the dependent >and independent variables are not normally distributed. >I have also transformed the data (both dependent and independent variables) >using log, arcsine, and square root transformations. When I run the >normality tests on the transformed data, the test shows that even the >transformed data is not normally distributed. If you have a linear model, transforming it will generally make it non-linear. Linearity in the relationship remains the most important property; normality is one of the least. >I realize that I can use nonparametric tests for correlation (I will use >Spearman), but is there a nonparametric linear regression? If not, is it >acceptable to use linear regression analysis on data that is not normally >distributed as a way to show there is a linear relationship? Consider your probability model; what can you assume is a linear function of what with additive errors. YOU, the user, must answer that. -- This address is for information only. I do not claim that these views are those of the Statistics Department or of Purdue University. Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 hrubin@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558 --------------------------------------------------------------------------- From: Dale Glaser Date: 2000/05/12 Subject: RE: normality and regression analysis Newsgroups: sci.stat.consult, sci.stat.edu Mike...........regression assumptions are more concerned with distributional characteristics of the errors than the actual raw score, in that if residuals are normally distributed, there is a constancy of variation of the errors across the x axis (i.e., homescedasticity), etc., then non-normality of the criterion variable is not problematic. However, there may be occasions when violations of the error assumptions ensue and transformation of the criterion variable serves in coaxing normality. ............dale glaser --------------------------------------------------------------------------- From: Steve Gregorich Date: 2000/05/12 Subject: Re: normality and regression analysis Newsgroups: sci.stat.consult, sci.stat.edu Mike, As a demonsrtation to myself, I once fit OLS regression models to data with (1) a non-uniformly distributed binary outcome and (2) a continuous outcome with a U-shaped distribution. I then used the same models to estimate the parameter standard errors using a naive bootstrap. The distribution of the bootstrap parameter estimates in both cases was normal (judging from the Q-Q plots and standard tests of normality). Normality of parameter estimates isn't everything--so I am not suggesting that you use OLS regression indiscriminantly. But some people apparently believe there is no way for parameter estimates to be normally distributed when the data are not. That simply is not the case. BTW, do you really have a book stating that the data need be normally distributed in order to satisfy the assumptions of OLS regression? I wouldn't be happy with that book. -- ---------------------------------------------- Steve Gregorich University of California, San Francisco Center for AIDS Prevention Studies Medical Effectiveness Research Center Center for Aging in Diverse Communities 74 New Montgomery Street San Francisco CA 94105 gregorich@psg.ucsf.edu http://sites.netscape.net/gregorich/index.html ---------------------------------------------- --------------------------------------------------------------------------- From: Herman Rubin Date: 2000/05/13 Subject: Re: normality and regression analysis Newsgroups: sci.stat.consult, sci.stat.edu In article <8fhfuf$rce@itssrv1.ucsf.edu>, Steve Gregorich wrote: >Mike, >As a demonsrtation to myself, I once fit OLS regression >models to data with (1) a non-uniformly distributed >binary outcome and (2) a continuous outcome with a >U-shaped distribution. I then used the same models to >estimate the parameter standard errors using a naive >bootstrap. The distribution of the bootstrap parameter >estimates in both cases was normal (judging from the >Q-Q plots and standard tests of normality). Normality >of parameter estimates isn't everything--so I am not >suggesting that you use OLS regression indiscriminantly. >But some people apparently believe there is no way for >parameter estimates to be normally distributed when the >data are not. That simply is not the case. It is standard asymptotic theory that the regression coefficients are asymptotically normal with the calculated variance (covariance matrix in multiple regression) if the true residuals (error terms) are merely uncorrelated with the predictor variables, and the predictor variables have reasonable variances. One can even get by with slightly less. One does lose the precision of the p-values without normality, but this is very definitiely unimportant, and the bootstrap does a first-order correction. These are the real robustness considerations, not the cute ones by the peddlers of methods which rely on much stronger assumptions. Lack of correlation between residuals and predictors is much less of an assumption than symmetry, etc. It is also preserved under such operations as aggregation of dependent variables. >BTW, do you really have a book stating that the data >need be normally distributed in order to satisfy the >assumptions of OLS regression? I wouldn't be happy >with that book. I will make a much stronger statement; those who assume that normality SHOULD hold in the model of "real" data rarely understand probability, which is at the foundation. It may be true that normality is approximately true, but adjusting the data to try to make it exact is still likely to mess up those relations which hold. Least squares was heavily used in the 19th century in physics, astronomy, and surveying, even non-linear least squares. The data were not adjusted to normality, as this would have destroyed the model. -- This address is for information only. I do not claim that these views are those of the Statistics Department or of Purdue University. Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 hrubin@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558 ===========================================================================