Psychologist logo
BPS updates, Methods and statistics

Methods - The perils of statistics by numbers

Thom Baguley (Nottingham Trent University) is wary of magic cut-offs; plus Daniel Wright on an improved analysis of covariance.

24 March 2008

Statistical practice is peppered with examples of 'magic' numbers such as the infamous a = .05 of null hypothesis significance testing. Similarly, sample size calculations typically assume a desired statistical power of .80 and nearly always use 'canned' effect sizes such as d = 0.5 (Lenth, 2001; Baguley, 2004). Some of the most interesting cases  are for commonly used cut-off criteria, such as keeping factors where the eigenvalue > 1 in factor analysis; requiring reliability > .70 in scale construction; or a goodness-of-fit-index or GFI > .90 in structural equation modelling (Lance et al., 2006).

What's interesting is that these 'magic' numbers all have several things in common.First, they all seem to have arisen by a process of academic Chinese Whispers (see Vicente & Brewer, 1993, for empirical evidence of this kind of effect in use of citations). Take the requirement for reliability > .70. Lance et al. (2006) show that where a source is given for this cut-off it is almost invariably Nunnally's Psychometric Theory. However, the closest Nunnally comes to saying this is:

In the early stages of research . . . one saves time and energy by working with instruments that have only modest reliability, for which purpose reliabilities of .70 or higher will suffice…(Nunnally, 1978, p.245)

Nunnally even notes that reliabilities in excess of 0.90 may be inadequate in many situations. The problem is that while early citations of Nunnally's book were largely accurate, later authors cited him without consulting the original text. Fairly soon the nuance and context of the original source is lost and the citation becomesthe ritual acknowledgement of a common authority. Eventually reliability > .70 becomes commonly accepted wisdom– more accurately myth or urban legend (Vandenberg, 2006) – and a citation is no longer required.

Second, they are – as far as I can tell, without exception – wrong. To be fair, they do differ in degree of wrongness: ranging from mildly misleading to terminally myopic. In addition, they are all wrong for more or less the same reason. In every case they involve taking a continuous quantity and reducing it to a single number. Take, for instance, the a = .05 threshold. As Rosnow & Rosenthal put it:

Surely, God loves the .06 nearly as much as the .05. Can there be any doubt that God views the strength of evidence for or against the null as a fairly continuous function of the magnitude of p?
[Rosnow & Rosenthal, 1989, p.177]

This isn't just the prerogative of null hypothesis significance testing (though it can claim some of the most widespread and most egregious cases). For example, in a recent paper on alternative (Bayesian) approaches to statistical inference Wagenmakers (2007) wryly comments that 'people apparently find it difficult to deal with continuous levels of evidence' before discussing the lumping of Bayes factors into discrete categories, such as that suggested by Raftery (1995). Wagenmakers points out that 'statistically, this desire [to reduce a Bayes factor to a dichotomous decision] is entirely unfounded'

I would argue that this conclusion could readily be generalised to other statistics. It isn't always a good idea to present findings as the output of such a dichotomous decision. Even when it is necessary (or convenient) to reach a decision on this way, it makes no sense to use the same threshold or cut-off in all situations.

My own favourite example is collinearity in multiple regression. Collinearity does not simply arise when two predictors are correlated > .90 (though very high values can make computing the coefficients tricky depending on the software you use).

As long as there is any shared variance between predictors it will be somewhat difficult to tease the effects of the predictors apart, and the precision with which their individual effects are measured will suffer (and thus statistical power to detect effects of individual predictors is compromised).

It isn't all doom and gloom, though. I recognise that people sometimes find simple rules of thumb useful, and many of the examples I've used started off in just this way. Cohen's effect size guidelines and recommendations for power calculations are a case in point (e.g. Cohen, 1969). In particular, estimating statistical power for supposedly 'small' (d = .2), 'medium' (d = .5) or 'large' (d = .8) effects was at first intended only as a method of last resort. The problem is that when these rules of thumb become articles of faith in the rituals of statistics, the benefits they provide are lost.

The solution is simple. Just remember that although these 'magic' numbers can occasionally be useful, they are always wrong. Don't end up doing statistics by numbers.

References 

Baguley, T. (2004). Understanding statistical power in the context of applied research. Applied Ergonomics, 35, 73–80.

Cohen, J. (1969). Statistical power analysis for the behavioral sciences. New York: Academic Press.

Hastie, T. & Tibshirani, R. (1990). Generalized additive models. London: Chapman and Hall.

Lance, C.E., Butts, M.M., & Michels, L.C. (2006). The sources of four commonly reported cutoff criteria: What did they really say? Organizational Research Method, 9, 202–220.

Lenth, R.V. (2001). Some practical guidelines for effective sample size determination. The American Statistician, 55, 187–193.

Mosteller, F. & Boruch, R. (Ed.) (2002). Evidence matters: Randomised trials in education research. Washington, DC: The Brookings Institution.

Nunnally, J.C. (1978). Psychometric Theory (2nd edn). New York: McGraw-Hill.

R Development Core Team (2007). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.

Raftery, A.E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111–196.

Rosnow, R.L. & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276–1284.

Vandenberg, R.J. (2006). Introduction: Statistical and methodological myths and urban legends: Where, pray tell, did they get this idea. Organizational Research Methods, 9, 194–201.

Vicente, K.J. & Brewer, W.F. (1993). Reconstructive remembering of the scientific literature. Cognition, 46, 101–128.

Wagenmakers, E-J. (2007). A practical solution to the pervasive problems of p-values. Psychonomic Bulletin & Review, 14, 779–804.

Wood, S.N. (2006). Generalized additive models: An introduction with R. London: Chapman & Hall/CRC.

Wright, D.B. (2006). Causal and associative hypotheses in psychology. Psychology, Public Policy, and Law, 12. 190–213.

Wright, D.B. & London, K. (2008). Modern regression techniques: Examples for psychologists. London: Sage.

A new improved analysis of covariance

Daniel B. Wright (University of Sussex) on the GAMCOVA

Analysis of covariance (ANCOVA)is one of the most used statistical procedures in psychology. It allows you to measure an association between two variables after controlling for one or more covariates. How this controlling is done is important: it depends on many assumptions, and has consequences for any conclusions.

One assumption is that the covariate and the response variable are linearly related. New statistical procedures allow more flexible ANCOVAs to be conducted. This procedure increases the chances of observing a statistically significant effect, and is simple to conduct with freely available software. Consider a typical ANCOVA. Researchers want to see whether some new teaching method works. They allocate half of their sample into a New Method condition and half into a Control condition, and measure attainment at time2. While random allocation makes causal inference easier (Wright, 2006), it is not always done in this type of research (Mosteller & Boruch, 2002). Because the pupils in one of the groups may begin with more skill, the researchers use some measure of skill taken prior to the intervention (call this time1) and use this to try to equate, statistically, the two groups.

There are several ways of 'equating', but the most popular is the traditional ANCOVA. This assumes that straight lines with equal slopes represent the data patterns for the different groups. The group effect is the distance between the lines. (The parallel slope assumption can be relaxed by allowing an interaction between the covariate and grouping variable, but here we concentrate on the 'straight line' assumption).

The new and improved ANCOVA allows the relationship between the time1 and time2 scores to be modelled with a smooth curve called a 'spline'. The mathematics is complex.

The early work is described in Hastie and Tibshirani (1990),and more recent developments in Wood (2006). Neither of these is for the faint of heart. We have provided an introduction suitable for psychologists elsewhere (Wright & London, 2008).

The logic of splines is that a small number of curves are fit together to form one continuous curve. The individual pieces are often cubic polynomials (i.e. y = B0 + B1x + B2x2 + B3x3) and where they connect are called 'knots'. Thanks to really clever maths they connect so smoothly that it is difficult to see where the knots are. When the splines are embedded in a statistical model it is often called a 'generalised additive model', or a GAM. We use the word, GAMCOVA, an amalgamation of GAM+ANCOVA, for the procedure we describe.

GAMs are a powerful and flexible tool, because you can tell the computer what kind of curves you want, how many knots to have, and where the knots should be. These models can handle response variables that are continuous, as usually assumed with ANCOVA, but also those that are binary, proportions, or counts. As an alternative to the traditional ANCOVA it is usually fine to use a very simple spline. While a single cubic polynomial requires three pieces of information (or three degrees of freedom in statistical jargon), adding each extra cubic section only requires one further piece of information. This is because forcing the curves to meet smoothly constrains certain aspects of the curves.

Thus, this approach uses only four pieces of information for a flexible covariate rather than the one required for the covariate in the traditional ANCOVA. The equation for GAMCOVA is:

time2 = B0 + bs(time1i) + B2 groupi + ei

where bs stands for a type of spline called a B-spline. There are several types, but B-splines work fine for our purposes. We tell the computer what degree polynomial curve is desired (1 = linear, 2 = quadratic, 3 = cubic, etc.) and how many degrees of freedom for the total spline, which is the degree of the individual curves plus how many curves there are. When you just have two curves the default location for the knot is the median; and because the default polynomial in the package used is cubic, all we need to say is: bs(time1,df=4).

There are several packages that can do these analyses, but a comprehensive and free one is R (R Development Core Team, 2007). Details of how to download R, access some example data, make the graphs and run the analyses, are available at www.sussex.ac.uk/Users/danw/gamcova.html. If we compare the two approaches using our example data, the group effect accounts for about the same sums of squares in both procedures. But with the GAMCOVA the flexible covariate accounts for much more of the residual sums of squares, so there is less 'error' left in the model. The means squared error usually shrinks, and therefore the F value will usually be larger and you are more likely to find a significant group effect with the GAMCOVA than with the traditional ANCOVA.

Bite size
Problem:
ANCOVA assumes that the predictors (IV), covariates and outcomes (DV) are linearly related, and that data are continuous.
Solution: GAMCOVA
Benefits: It can deal with different sorts of data, and it often reduces the amount of error in the model (thereby increasing the likelihood of detecting significant group differences).
What software do you need? R (R Development Core Team, 2007)
What type of study might you use this analysis? Where you want to include a covariate that violates the assumptions of ANCOVA; for example, when you want to control for baseline levels, or control for a variable that may be unduly influencing your results (e.g. individual difference variables).
Glossary
GAM – Generalised additive model
ANCOVA – Analysis of covariance
Spline – Curve of relationship between variables
Knot – Connection point of different splines

Contribute to the new 'Methods' section
As a psychologist in academia, I was pretty awful when it came to methods. I had the research ideas, and when it came to the discussion section you couldn't shut me up. But the methods / results section usually found me tearing my hair out, or besieging the department 'methods monkey' (there always was one).

For a few years I got by. But I started to worry. What if methods monkey left? Or, as unlikely as it seemed, got sick of me calling him methods monkey and failing to grasp the finer points of multiple regression? Perhaps most worryingly, what if I was stumbling blindly down a blind alley, blindfolded by my lack of methodological muscle? So I jumped ship for The Psychologist, where I got to mess about with mixed metaphors and alliteration to my heart's content, and my mind was spared the regular numbing by numbers.

But deep down I knew that wasn't right either. Surely there were other people out there just like me, consigned to correlations for all eternity, burying their head in the sand over grounded theory… in essence, just not knowing what they don't know. Perhaps The Psychologist could help them.

So I made a new 'Methods' section a key part of my planning for the 2008 redesign. The brief is simple: we are looking for cutting-edge thinking and practical assistance in methods in their widest sense: qualitative, quantitative, mixed, etc., etc. In fact, it's wider than that: it will hopefully include ethics, and absolutely anything else about the process of conducting research and practice. It's the 'how' of psychology.

The first two pieces deal broadly with quantitative methods, but don't panic: other pieces are in the pipeline. But not many, and that's where you come in. I need to keep up a regular flow of short (500- to 1000-word) pieces, truly reflecting the scope of the discipline and the subject. If you are interested, or know of anyone else that might be, please get in touch on [email protected].
Dr Jon Sutton, Editor