From: Alan McLean <alan.mclean@buseco.monash.edu.au>

Subject: Hypothesis testing and magic

Date: Wednesday, April 12, 2000 8:18 AM

I have been reading all the back and forth about hypothesis testing with some degree of fascination. It's a topic of particular interest to me - I presented a paper called 'Hypothesis testing and the Westminster System' at the ISI conference in Helsinki last year.

What I find fascinating is the way that hypothesis testing is regarded as a technique for finding out 'truth'. Just wave a magic wand, and truth will appear out of a set of data (and mutter the magic number 0.05 while you are waving it....) Hypothesis testing does nothing of the sort - of course.

First, hypothesis testing is not restricted to statistics or 'research'. If you are told some piece of news or gossip, you automatically check it out for plausibility against your knowledge and experience. (This is known colloquially as a 'shit filter'.) If you are at a seminar, you listen to the presenter in the same way. If what you hear is consistent with your knowledge and experience you accept that it is probably true. If it is very consistent, you may accept that it IS true. If it is not consistent, you will question it, conclude that it is probably not true.

IF the news is something that requires some action on your part, you will act according your assessment of the information.

If the news is important to you, and you cannot decide which way to go on prior knowledge, you will presumably go and get corroborative information, hopefully in some sense objective information.

This describes hypothesis testing almost exactly; the difference is a matter of formalism.

Next - a statistical hypothesis test compares two probability models of 'reality'. If you are interested in the possible difference between two populations on some numeric variable - for example, between heights of men and heights of women in some population group - and you choose to express the difference in terms of means, you are comparing a model which says

- height of a randomly chosen individual = overall mean + random fluctuation

with one which says

- height of a randomly chosen individual = overall mean + factor due to sex + random fluctuation

You then make assumptions about the 'random fluctuations'.

Note that one of these models is embedded within the other - the first model is a particular case of the second. It is only in this situation that standard hypothesis testing is applicable.

Neither of these models is 'true' - but either or both may be good descriptions of the two populations. Good in the sense that if you do start to randomly select individuals, the results agree acceptably well with what the model predicts. The role of hypothesis testing is to help you decide which of these is (PROBABLY) the better model - or if neither is.

In standard hypothesis testing, one of these models is 'privileged' in that it is assumed 'true' - that is, if neither model is better, then you will use the privileged model. In most cases, this means the SIMPLER model.

More accurately - if you decide that the models are equally good (or bad) you are saying that you cannot distinguish between them on the basis of the information and the statistical technique used! To decide between them you will need either to use a different technique, or more realistically, some other criterion. For example, in a court case, if you cannot decide between the models 'Guilty' and 'Innocent', you may always choose 'Innocent'.

There is no reason why one model is thus privileged. In my paper I stressed my belief that this approach reflects our (and Fisher's) cultural heritage rather than any need for it to be that way. One can for example express the choice as between the embedded model and the embedded model suggested by the data. For a test on the difference between two means, this considers the models mu(diff) = 0 and mu(diff) = xbar. The interesting thing is that this is what we actually do! although it is dressed up in the language and technique of the general model mu(diff) not= 0. (This dressing up is a lot of the reason why students have trouble with hypothesis testing.)

To conclude: hypothesis testing is NECESSARY. We do it all the time. Assessment of effect sizes is also necessary, but the two should not be confused.

Regards,

Alan

--

Alan McLean (alan.mclean@buseco.monash.edu.au)

Department of Econometrics and Business Statistics

Monash University, Caulfield Campus, Melbourne

Tel: +61 03 9903 2102 Fax: +61 03 9903 2007

------------------------------------------------------------------------

From: Alan McLean <alan.mclean@buseco.monash.edu.au

Subject: Hypothesis testing and magic - episode 2

Date: Thursday, April 13, 2000 7:56 AM

Some more comments on hypothesis testing:

My impression of the ‘hypothesis test controversy’, which seems to exist primarily in the areas of psychology, education and the like (this is coming from someone who has been involved in education for all my working life, but with a scientific/mathematical background), is that it is at least partly a consequence of the sheer difficulty of carrying out quantitative research in those fields. A root of the problem seems to be definitional. I am referring here to the definition of the variables involved.

In, say, an agricultural research problem it is usually easy enough to define the variables. For a very simple example, if one is interested in comparing two strains of a crop for yield, it is very easy to define the variable of interest. It is reasonably easy to design an experiment to vary fairly obvious factors and to carry out the experiment.

In the ‘soft’ sciences it is easy enough to identify a characteristic of interest – the problem is how to measure it. If I am interested in the relationship between ability in statistics and ethnic background, for example, I measure the statistics ability using a test of some sort; I measure ethnic background by defining a set of ethnicities. There are literally an infinite number of combinations that I can use – infinitely many different tests, all purporting to measure ‘statistics ability’ (even if I change only one word in a test, I cannot be absolutely certain of its effect, so it is a different test!), and a very large number of definitions of ‘ethnicity’.

This is of course not news to anyone reading this. But I am coming to my point. Suppose I carry out an ‘experiment’ – I apply the test to a group of people of varying ethnicity, score them on the test and analyse the results, including a hypothesis test to decide if statistics ability is related to ethnicity. This test might be a simple ANOVA, or a Kruskal-Wallis or a chi square test, depending on how I score the test.

As I said earlier, a hypothesis test only helps the user to decide which of two models is probably better. The point of the above paragraphs is this: the definition of the models being compared includes the definition of the variables used. If I reject the null model (a label I prefer to ‘null hypothesis’) – that is I decide that the alternative model is (likely to work) better – I am NOT saying that there is a relationship between statistics ability and ethnicity. All I am saying is that there is a relationship between the two variables I used.

Please note that the test is not saying this – I am. The test merely gives me a measure of the strength of the evidence provided by the data (‘significant at 1%’ or ‘p-value of .0135’); this measure is only relevant if the models I have used are appropriate. I can use other evidence (experience is what we usually use! but there may be related tests that help) to decide if the model is appropriate.

So there are three levels at which judgement is used to make decisions: deciding what variables are to be used to measure the characteristics of interest, and how any relationship between them relates to the characteristics deciding on the model to be used, and how to test it deciding the conclusion for the model.

In each of these there is evidence we use to help us make the decision. The hypothesis test itself provides the test for the third.

Finally (at least for the moment) – whether we choose the null or alternative model, it IS a decision. In research, accepting the null means that we decide to accept it at least for the moment, so it is not necessarily a committed decision. On the other hand, if a line of investigation is not yielding results, the researcher is likely to not continue on that line – so it is a decision which does lead to an action.

For non research applications such as in quality control, accepting the null model quite clearly is a decision to act on the basis of that. For example, with a bottle filling machine which is periodically tested as to the mean contents, the null is that the machine is filling the bottles correctly. Rejecting the null entails stopping the machine; accepting it means the machine will not be stopped.

Traditional hypothesis testing does incorporate a decision-theoretic loss function – the p-value.

Regards again,

Alan

--

Alan McLean (alan.mclean@buseco.monash.edu.au)

Department of Econometrics and Business Statistics

Monash University, Caulfield Campus, Melbourne

Tel: +61 03 9903 2102 Fax: +61 03 9903 2007

------------------------------------------------------------------------