The following series of posts appeared in newsgroups sci.stat.consult
and sci.stat.edu in May 2000.

---------------------------------------------------------------------------

From: Mike
Date: 2000/05/11
Subject: normality and regression analysis
Newsgroups: sci.stat.consult, sci.stat.edu

I would like to obtain a prediction equation using linear regression
for some data that I have collected.  I have read in some stats books that
linear regression has 4 assumptions, 2 of them being that 1) data is
normally distributed and 2) constant variance.  In SAS, I have run
univariate analysis testing for normality on both my dependent and
independent variable (n=147). Both variables have distributions that
are skewed.

For the dependent variable:  skewness=0.69 and Kurtosis=0.25.
For the independent variable: skewness=0.52 and Kurtosis= -0.47.

The normality test (Shapiro-Wilk Statistic) states that both the
dependent and independent variables are not normally distributed.

I have also transformed the data (both dependent and independent
variables) using log, arcsine, and square root transformations.  When I run the
normality tests on the transformed data, the test shows that even the
transformed data is not normally distributed.

I realize that I can use nonparametric tests for correlation (I will
use Spearman), but is there a nonparametric linear regression?  If not, is
it acceptable to use linear regression analysis on data that is not
normally distributed as a way to show there is a linear relationship?

thanks in advance..Mike

---------------------------------------------------------------------------

From: Jon Cryer
Date: 2000/05/11
Subject: Re: normality and regression analysis
Newsgroups: sci.stat.consult, sci.stat.edu


Mike:

It's really the error terms in the regression model that are required
to have normal distributions with constant variance. We check this by
looking at the properties of the residuals from the regression. You shouldn't
expect the response (dependent) variable to have a normal distribution with a
fixed mean since then you wouldn't be doing regression.

By the way, you have a fine Statistics Department at VPI. I am sure
they do excellent consulting.

Jon Cryer

---------------------------------------------------------------------------

From: Alan Miller
Date: 2000/05/12
Subject: Re: normality and regression analysis
Newsgroups: sci.stat.consult, sci.stat.edu


The importance of normality is grossly over-emphasized.
If you have grossly long tails - in the residuals - or if there
are outliers, then you may have to look at more robust forms
of regression than least squares.   Auto-correlations amongst
consecutive observations, or lack of homogeneity  of variance
are often more important than non-normality.

There is absolutely no requirement that the predictors (or
independent variables) should have a normal distribution, in fact
the opposite.   Ideally, the predictors should be from a designed
experiment and hence will not even be random.   Most of them
should be towards the outer bounbaries of the predictor space.
I guess that most designed experiments would give negative
kurtoses if you go through the mechanics of calculating
coefficients of kurtosis.    If the predictors are from a multivariate
normal distribution, then there is far too much clustering in the
centre of the design space.   Quite often, some of the predictors
are discrete - perhaps (0,1) variables, and hence cannot have
normal distributions.

--
Alan Miller, Retired Scientist (Statistician)
CSIRO Mathematical & Information Sciences
Alan.Miller -at- vic.cmis.csiro.au
http://www.ozemail.com.au/~milleraj
http://users.bigpond.net.au/amiller/


---------------------------------------------------------------------------

From: Gary McClelland
Date: 2000/05/12
Subject: Re: normality and regression analysis
Newsgroups: sci.stat.consult, sci.stat.edu


In reply to Mike's question Allan makes the important point:

>
> There is absolutely no requirement that the predictors (or
> independent variables) should have a normal distribution, in fact
> the opposite.   Ideally, the predictors should be from a designed
> experiment and hence will not even be random.   Most of them
> should be towards the outer bounbaries of the predictor space.
> I guess that most designed experiments would give negative
> kurtoses if you go through the mechanics of calculating
> coefficients of kurtosis.    If the predictors are from a multivariate
> normal distribution, then there is far too much clustering in the
> centre of the design space.   Quite often, some of the predictors
> are discrete - perhaps (0,1) variables, and hence cannot have
> normal distributions.

Few researchers using multiple regression realize that normal
distributions for their predictors are, in terms of statistical
power, about the worst-case scenario.  And the power problems
exacerbate when testing interactions and polynomial terms.
In McClelland & Judd (1993, Psych Bulletin, 114:376-390) we
show that to achieve equivalent statistical power for
detecting the linear-by-linear interaction, a study with
normally distributed predictors requires over 16 times as
many observations as a designed experiment.  And contrary
to most researchers' expectations, skewnews helps and even
correlation between predictors helps because it avoids the
"too much clustering in the centre of the design space" that
Allan refers to.

gary

---------------------------------------------------------------------------

From: Dan Bonnick
Date: 2000/05/12
Subject: Re: normality and regression analysis
Newsgroups: sci.stat.consult, sci.stat.edu


Hi Mike.

For the most popular linear regression Ordinary least squares (OLS),
you also need to have your X variable (i.e. the independent variable)
having a relatively small error. Your initial work suggests large-ish error in
both variables with non-normal error structure. This makes things a little
more complicated. Do you suspect that the errors correlated (i.e. if error
in X is +d, then is error in Y usually +f ), are they of similar size are
they changing with the size of X ?
A formal way forward may be the use of structural equations (SAS PROC
MIXED may help).

IF you just need something quite reliable, then there are some
non-parametric regression methods (Deming, Passing+Bablock) which give
you quite robust estimates... For a hint, try forming the set of all slopes
from joining any two points in your set, then look at the median of the
set of slopes (and then the 95th and 5th percentile slope estimates). Then
try the same for intercepts. This should give you a workable figure...

HTH
Dan

---------------------------------------------------------------------------

From: Herman Rubin
Date: 2000/05/12
Subject: Re: normality and regression analysis
Newsgroups: sci.stat.consult, sci.stat.edu


In article <8ffek1$1q2$1@solaris.cc.vt.edu>, Mike <mbattagl@vt.edu>
wrote:
>I would like to obtain a prediction equation using linear regression for
>some data that I have collected.  I have read in some stats books that
>linear regression has 4 assumptions, 2 of them being that 1) data is
>normally distributed and 2) constant variance.  In SAS, I have run
>univariate analysis testing for normality on both my dependent and
>independent variable (n=147). Both variables have distributions that are
>skewed.

There is no reason to assume that the data are normal.  For
linear regression to be exactly the MLE procedure, it is the
residuals from the true regression which need to have certain
properties.  In well designed experiments, the independent
variables are never normal.  Rarely will the dependent variables
be close to normal, either.

The key properties for the residuals are lack of correlation
with the independent variables, independence, and homoscedasticity.
Normality is well down the list.  Remember that this is for
the residuals, not the data.  Linearity of the model is a
consequence of these.

>For the dependent variable:  skewness=0.69 and Kurtosis=0.25.
>For the independent variable: skewness=0.52 and Kurtosis= -0.47.

>The normality test (Shapiro-Wilk Statistic) states that both the dependent
>and independent variables are not normally distributed.

>I have also transformed the data (both dependent and independent variables)
>using log, arcsine, and square root transformations.  When I run the
>normality tests on the transformed data, the test shows that even the
>transformed data is not normally distributed.

If you have a linear model, transforming it will generally
make it non-linear.  Linearity in the relationship remains
the most important property; normality is one of the least.

>I realize that I can use nonparametric tests for correlation (I will use
>Spearman), but is there a nonparametric linear regression?  If not, is it
>acceptable to use linear regression analysis on data that is not normally
>distributed as a way to show there is a linear relationship?

Consider your probability model; what can you assume is a
linear function of what with additive errors.  YOU, the
user, must answer that.

--
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette
IN47907-1399
hrubin@stat.purdue.edu         Phone: (765)494-6054   FAX:
(765)494-0558

---------------------------------------------------------------------------

From: Dale Glaser
Date: 2000/05/12
Subject: RE: normality and regression analysis
Newsgroups: sci.stat.consult, sci.stat.edu


Mike...........regression assumptions are more concerned with
distributional characteristics of the errors than the actual raw score, in that if
residuals are normally distributed, there is a constancy of variation
of the errors across the x axis (i.e., homescedasticity), etc., then
non-normality of the criterion variable is not problematic.  However, there may be
occasions when violations of the error assumptions ensue and
transformation of the criterion variable serves in coaxing normality.
............dale glaser

---------------------------------------------------------------------------

From: Steve Gregorich
Date: 2000/05/12
Subject: Re: normality and regression analysis
Newsgroups: sci.stat.consult, sci.stat.edu


Mike,
As a demonsrtation to myself, I once fit OLS regression
models to data with (1) a non-uniformly distributed
binary outcome and (2) a continuous outcome with a
U-shaped distribution.  I then used the same models to
estimate the parameter standard errors using a naive
bootstrap.  The distribution of the bootstrap parameter
estimates in both cases was normal (judging from the
Q-Q plots and standard tests of normality).  Normality
of parameter estimates isn't everything--so I am not
suggesting that you use OLS regression indiscriminantly.
But some people apparently believe there is no way for
parameter estimates to be normally distributed when the
data are not.  That simply is not the case.

BTW, do you really have a book stating that the data
need be normally distributed in order to satisfy the
assumptions of OLS regression?  I wouldn't be happy
with that book.

--
----------------------------------------------
Steve Gregorich
University of California, San Francisco
   Center for AIDS Prevention Studies
   Medical Effectiveness Research Center
   Center for Aging in Diverse Communities
74 New Montgomery Street
San Francisco CA  94105
gregorich@psg.ucsf.edu
http://sites.netscape.net/gregorich/index.html
----------------------------------------------

---------------------------------------------------------------------------

From: Herman Rubin
Date: 2000/05/13
Subject: Re: normality and regression analysis
Newsgroups: sci.stat.consult, sci.stat.edu


In article <8fhfuf$rce@itssrv1.ucsf.edu>,
Steve Gregorich <gregorich@psg.ucsf.edu> wrote:
>Mike,
>As a demonsrtation to myself, I once fit OLS regression
>models to data with (1) a non-uniformly distributed
>binary outcome and (2) a continuous outcome with a
>U-shaped distribution.  I then used the same models to
>estimate the parameter standard errors using a naive
>bootstrap.  The distribution of the bootstrap parameter
>estimates in both cases was normal (judging from the
>Q-Q plots and standard tests of normality).  Normality
>of parameter estimates isn't everything--so I am not
>suggesting that you use OLS regression indiscriminantly.
>But some people apparently believe there is no way for
>parameter estimates to be normally distributed when the
>data are not.  That simply is not the case.

It is standard asymptotic theory that the regression
coefficients are asymptotically normal with the calculated
variance (covariance matrix in multiple regression) if the
true residuals (error terms) are merely uncorrelated with
the predictor variables, and the predictor variables have
reasonable variances.  One can even get by with slightly
less.  One does lose the precision of the p-values without
normality, but this is very definitiely unimportant, and
the bootstrap does a first-order correction.

These are the real robustness considerations, not the
cute ones by the peddlers of methods which rely on much
stronger assumptions.  Lack of correlation between
residuals and predictors is much less of an assumption
than symmetry, etc.  It is also preserved under such
operations as aggregation of dependent variables.

>BTW, do you really have a book stating that the data
>need be normally distributed in order to satisfy the
>assumptions of OLS regression?  I wouldn't be happy
>with that book.

I will make a much stronger statement; those who assume
that normality SHOULD hold in the model of "real" data
rarely understand probability, which is at the foundation.
It may be true that normality is approximately true, but
adjusting the data to try to make it exact is still likely
to mess up those relations which hold.

Least squares was heavily used in the 19th century in
physics, astronomy, and surveying, even non-linear least
squares.  The data were not adjusted to normality, as
this would have destroyed the model.

--
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette
IN47907-1399
hrubin@stat.purdue.edu         Phone: (765)494-6054   FAX:
(765)494-0558


===========================================================================