Some of this document is reasonably polished, but other parts are rough notes only.

I have also put up a webpage on "How to do research", aimed at students beginning research (e.g., for Honours or PhD degrees): click here.

My name is Paul Hutchinson, and I work in the Department of Psychology, Macquarie University; but in previous existences I've worked in Statistics and Civil Engineering departments, and I think most of what I say here is quite widely true. This document is aimed particularly at those who are taking their first course in statistics (most of whom are specialising in some other subject). See also a page by Michael Hughes of Miami University, Ohio: click. Anyway, here is my advice on

How to study statistics

Do well in it --- if you do well, you are much more likely to enjoy it. (And in later years you may even make some money tutoring other students!)

If you have to work hard in order to do well --- so be it. Finding subjects difficult and needing to put in a lot of hours of study is usual.

Perhaps you're really interested in some other subject, and are only taking a statistics course because you have to. That's a nuisance, and I sympathise with you, but remember this: statistics is not something imposed from outside (by mathematicians, perhaps); statistics as a subject exists because psychologists and agriculturalists and engineers and medical researchers and economists built it, and they built it because they needed it. You will find that you need it, too.

Overview of introductory statistics.

A typical introductory statistics course is in three parts.

• Data description. For example:
• Pictorial presentation of data (principally, a single numeric variable).
• Calculation of summary statistics (principally, a single numeric variable).
• Two numeric variables: scatterplots, correlation, regression.
• Probability. For example:
• Rules for doing calculations with probabilities. Diagrams that help with the calculations.
• Discrete probability distributions. The binomial distribution. The Poisson distribution.
• Continuous probability distributions. The uniform distribution. The exponential distribution. The normal distribution.
• Expectations and variances of random variables.
• Inference. Here, we try to draw conclusions about the population from which our sample came.
• The standard error of the mean. The Central Limit Theorem.
• The concept of using a sample statistic to estimate a population parameter. Criteria for choosing a good method.
• The general idea of hypothesis tests. Examples of hypothesis tests for particular situations.
• The general idea of confidence intervals. Examples of constructing confidence intervals.
• Inference in the linear regression context.
Hypothesis testing can come to dominate a statistics course, and even whole areas of the application of statistics. This is rather a pity, (a) because students ought not to get into the habit of thinking that hypothesis testing is the be-all and end-all of statistical analysis, and (b) because it is quite controversial just what hypothesis tests mean and how they should be used.
(These few sentences are rather advanced, don't worry if you're not yet at the point where you appreciate them.) In particular, the output from a calculation is an indication of the probability of the data conditional upon a particular (null) hypothesis being true. What an investigator wants is not this, but an indication of the probability of the hypothesis being true, conditional upon the data that was observed. A good many philosophers and statisticians would say this latter "probability" is a meaningless concept.
One can imagine statistical inference in its present form going entirely out of fashion, and being replaced by strength of evidence or cost-benefit analysis or something else. (It is much harder to imagine data description or probability becoming obsolete.)

Missing topics --- but important.

Notice a couple of things that are absent from the above list, though they are important when doing statistics "for real".

Rather little is usually said about the process of collecting the data. I think there are a variety of reasons for this, some of them good and some of them bad. One good reason is that there tends to be a lot of detail that is specific to the variable being measured and the purpose for which the measurement is made. Another is that students often do not appreciate the rights and wrongs of data collection (it all appears too easy, for one thing) until they have had some experience with describing data and drawing inferences.

Nor is there much said about the strategy used when approaching a dataset. By this, I mean things like:

• Quality control of the data,
• Building one's understanding of the data by:
• Looking at variables one at a time (graphical and numerical summaries,
• Deciding just what are your research questions, and which variables should be thought of as independent variables and which ones should be thought of as dependent,
• Looking at variables two at a time (scatterplots, correlation coefficients, etc.),
• Looking at variables three at a time (e.g., plotting y vs. x with different values of z distinguished),
• and so on.
• Possibly, getting rid of interaction by transformation of the dependent variable,
• Considering whether it is appropriate to test hypotheses, or whether this is impracticable (because, for instance, the sample is too far from random),
• Possibly, deciding to explore only a randomly-selected half of the dataset, using this to generate hypotheses, and using the other half to test these hypotheses.
(It is common to find most of the above in an introductory course, the point I am making is that it is rare to find much emphasis on putting them together.)

Two directions of approach.

Much of the statistical work associated with scientific research can be put into one of two categories --- modelling the mechanism and data analysis.

• With modelling, assumptions and deductions from those assumptions are prominent. Many simple "models" have been invented, and you need to be familiar with the basic repertoire. Some of them refer to the deterministic aspect of the situation, and some of them refer to the stochastic (random) aspect.
Suppose we are concerned with the number of events of a particular type. The deterministic part of the model might say that the expected (average) number of events is obtained by adding together two independent variables, appropriately weighted. The stochastic part might say that we can assume (a) the probability of an event happening is constant, and (b) the occurrence of events are independent. (It is useful to know that the binomial and Poisson probability distributions follow from this pair of assumptions --- it is unnecessary to re-invent these distributions every time these assumptions are appropriate!)
• With data analysis, the emphasis is on trying to understand what the data is telling us. We try to follow passively, without preconceptions about what is the appropriate model, where the data leads us.
• Good modelling keeps in close touch with data, good analysis leads us to models of what is going on, and the statistician is constantly switching from one to the other.

My feeling is that there is rather too little probability in introductory statistics courses these days. It could well be useful for you to take a course specifically in the subject. I know in practice a lot of students select courses partly by eliminating the subjects they hate (or think they hate). Try to overcome this as much as you can.

• Even if you hate mathematics, make sure you have a good grasp of algebra --- without it, you'll always be struggling in your statistics. (Many universities have some sort of bridging course available for students who have neglected or forgotten their algebra, and then find they need it.)
• Even if you hate the humanities, make sure you can write well in English --- an important part of statistics is the presentation of your results to your audience, and this necessitates explaining yourself.

Textbooks? Rather to my regret, it seems to be the fashion these days for students to only study the textbook that the lecturer recommends, rather than finding one that suits the way they themselves think. So there doesn't seem to be much point in my putting down some notes on texts that I like. For a description of a booklet that I've written, that is intended as a supplement to any conventional introductory text and as a revision aid, click here.

Should you work alone, or collaborate with your classmates? My view is that everyone needs to find their own individual way of understanding things, and this can only be done by struggling on their own. But I have met people whom I respect who say that for them the interchange with others is a vital part of learning. (I have no doubt that the good student benefits from explaining something to poorer students. What worries me is a feeling that this may actually do more harm than good to the poorer students, because they haven't struggled through to the answer themselves.)

Perhaps the most obvious point of all --- to even start to learn something, you need four exposures to it: the lecture, the textbook, the tutorial, and the homework. So go to all lectures, tutorials, and practical classes, do all the homework assigned to you and a bit more. Your course is designed on the assumption you do all of these things (and is designed to be passable by the average student who does them, not only by some genius). It will very quickly become very difficult if you start missing things.

"Just put the numbers into the formula." This is a very undesirable way of doing statistics, but if all else fails, it may be better than nothing. For one thing, you might choose the right formula and get the right answer and pass the exam. For another, doing things without understanding sometimes leads later to understanding.

Most statistics lecturers are pleasant enough, and you should not be afraid to go to them for help with the course, if you need it. They may even announce certain hours when they'll be in their office and available to assist. But be reasonable about this --- if you have missed a class and therefore haven't got whatever was handed out then, you're responsible for your problem, and you should solve it by copying the paper from a friend.

In the olden days, poor lecturers used to defend themselves by saying that by teaching badly, they forced students to think for themselves and learn. This is mostly nonsense, but nevertheless does contain just an element of truth, I feel. There are at least two dangers with a course that is too well-organised and slickly presented. (a) There may be an over-emphasis on training, as contrasted with education. For example, students may learn how to smoothly and competently tackle a problem of a certain type, yet not recognise it when the wording is changed slightly. (b) Students may overestimate how much they are learning and understanding.

When you're doing an assessed piece of work, what you do will naturally be driven by what will get you the marks. But at an earlier stage, when you're actually learning the stuff, it will be helpful to think of partial answers that you could give if you didn't know the full answer. For example, suppose you recognise a particular question as requiring you to perform a nonpaired (that is, independent groups) t-test; as well as doing this, think about the partial answers you could have given at earlier stages of your course.

If you only knew about techniques of descriptive statistics, you might draw a box-and-whisker plot to compare the two samples.
If you knew about the standard error of the mean, you might draw a picture showing mean plus or minus 2s.e. of one sample beside mean plus or minus 2s.e. of the other sample.
If you knew that the variance of a difference is the sum of the variances (provided the random variables are independent), you might work out that the difference between the two means is such-and-such, and the corresponding standard error is so-and-so, and notice that the difference is or is not much larger than its s.e.

It is curious that even quite experienced teachers of statistics can find first-year exam questions set by someone else quite difficult --- the language is just sufficiently different that they're not altogether sure what the question is getting at. (I'm not sure what to make of this, except perhaps statistics is a difficult subject to teach and therefore statistics lecturers deserve a pay rise.)

Finally, try not to be too hard on your lecturers: "The young have aspirations that never come to pass, the old have reminiscences of what never happened. It's only the middle-aged who are really conscious of their limitations --- that is why one should be so patient with them."