Commentary: Null points—has interpretation of significance tests improved?

Type: Letter

Publication Date: 2003-10-01

Citations: 4

DOI: https://doi.org/10.1093/ije/dyg274

Abstract

Perhaps the most striking impression gained in reading Berkson’s piece, 1 more than 60 years since its publication, is the author’s struggle with questions of interpretation that still plague those conducting and interpreting statistical analyses today. Berkson seems to make little progress with solutions to the problems he presents, so it is of interest to see how statisticians today might deal with them. A P-value (significance level) is used to assess evidence against a null hypothesis. If, as Berkson states, we do not ‘find experimentalists typically engaged in disproving things’ then why does the formulation of statistical questions in terms of null hypotheses and their falsification remain so pervasive? Of course, the idea of science as a process of falsification was articulated in detail by Popper, and remains an attractive explanation of why, for example, Newton’s laws of mechanics were accepted until Einstein proved that there were circumstances in which they did not hold. Nonetheless Berkson argues forcefully that the usual discussion of evidence takes the form of positive statements (‘Someone has been murdered’) rather than negative ones (‘… evidence against the null hypothesis that no one is dead’). Why, in medical and psychological statistics, do we remain so attached to the formulation of null hypotheses? In the context of randomized trials, it still seems reasonable to demand that those proposing that resources be spent on a particular treatment should, as a minimum, provide evidence against there being no treatment effect at all. Similarly, so many factors have been postulated over the years, in the pages of this and other epidemiology journals, to be associated with a multitude of disease outcomes that some quantification of the possible role of chance in explaining observed results is enduringly useful. Further, in choosing a statistical model it is inevitable that we make decisions about the inclusion or otherwise of different covariates, different forms of these covariates (linear, non-linear, categorical), and interactions between them. Such a process is difficult to conduct without some recourse to null hypotheses stating that certain parameter values in a more complex model are zero, and hence that a simpler model is appropriate. It thus remains the case that there are genuine reasons for consideration, in the reporting of our statistical analyses, of the extent to which the data are compatible with particular null hypotheses. However, confusion still reigns over how this should be assessed: often manifesting itself in a confused mixture of Fisherian significance testing and Neyman-Pearson hypothesis testing. 2 As discussed in more detail in the commentary by Stone, 3 Fisher emphasized that research workers interpret significance levels in the light of their wider knowledge of the subject. 4 In contrast, Neyman and Pearson attempted to replace the subjective interpretation of P-values with an objective, decisiontheoretic interpretation of results. 5 However, both methods are misused: Fisher’s in that null hypotheses are mechanistically rejected if P < 0.05, and Neyman and Pearson’s in that results are interpreted without consideration of the Type II error rate that should be used in defining the critical region within which values of the test statistic lead to rejection of the null hypothesis. As shown by Oakes, 6 interpretation of P-values depends on both the power of tests and on the proportion of null hypotheses that are truly false. 7 How would modern medical statistics deal with the problems that Berkson raises? There has been a struggle since the 1970s in the pages of general medical journals against the misinterpretation of ‘non-significant’ differences (referred to by Berkson as ‘middle P’s’) as providing evidence in favour of the null hypothesis. 8 We now understand that P-values alone cannot be used to interpret statistical analyses: we need to consider the magnitude of estimated associations, and to examine confidence intervals in conjunction with P-values to prevent ourselves from being misled. 9 For example, well-reported analyses of the ‘experiences’ presented in Berkson’s Table 1 might note that the odds of success in judging the sex of a fetus were 1.5 (95% CI: 0.42, 5.32) in Experience 1, compared with 1.02 (95% CI: 0.90, 1.15) in Experience 2. Examination of these results would lead most people to agree with Berkson’s informally derived conclusions, though we might disagree with Berkson in noting that Experience 2 leaves open the possibility of the physician being able to discriminate modestly better (or worse) than by chance alone. A Bayesian statistician might present the posterior distribution for the odds of success, based on combining the data with her prior distribution—(s)he would

Locations

  • International Journal of Epidemiology - View - PDF
  • PubMed - View

Similar Works

Action Title Year Authors
+ PDF Chat I can see clearly now: Reinterpreting statistical significance 2019 Jonathan Dushoff
Morgan P. Kain
Benjamin M. Bolker
+ PDF Chat Null hypothesis significance testing: a guide to commonly misunderstood concepts and recommendations for good practice 2017 Cyril Pernet
+ PDF Chat Null hypothesis significance testing: a guide to commonly misunderstood concepts and recommendations for good practice 2017 Cyril Pernet
+ PDF Chat Will P‐Value Triumph over Abuses and Attacks? 2018 Jyotirmoy Sarkar
+ PDF Chat P-values – a chronic conundrum 2020 Jian Gao
+ PDF Chat P Values: From Suggestion to Superstition 2016 John Concato
J. A. Hartigan
+ PDF Chat Statistical significance versus clinical relevance 2016 Marieke H.C. van Rijn
Anneke Bech
Jean Bouyer
Jan A.J.G. van den Brand
+ I can see clearly now: reinterpreting statistical significance 2018 Jonathan Dushoff
Morgan P. Kain
Benjamin M. Bolker
+ PDF Chat Null hypothesis significance testing: a short tutorial 2016 Cyril Pernet
+ PDF Chat Null hypothesis significance testing: a short tutorial 2016 Cyril Pernet
+ PDF Chat Why not to (over)emphasize statistical significance 2019 Olaf M. Dekkers
+ I can see clearly now: reinterpreting statistical significance 2018 Jonathan Dushoff
Morgan P. Kain
Benjamin M. Bolker
+ When More Is Less: Pitfalls of significance testing 2022 Uwe Hassler
+ Evidence-based medicine or statistically manipulated medicine? Are we slaves to the <i>P</i>-value? 2024 Harsh Goel
Divisha Raheja
Sunil Nadar
+ PDF Chat The ongoing tyranny of statistical significance testing in biomedical research 2010 Andreas Stang
Charles Poole
Oliver Kuß
+ Statistical significance or clinical significance? A researcher's dilemma for appropriate interpretation of research results 2021 Hunny Sharma
+ Statistical Significance and the Dichotomization of Evidence 2017 Blakeley B. McShane
David Gal
+ PDF Chat Interpreting Results from Statistical Hypothesis Testing: Understanding the Appropriate P-value 2022 Eiki Tsushima
+ PDF Chat Reflection on modern methods: Statistics education beyond ‘significance’: novel plain English interpretations to deepen understanding of statistics and to steer away from misinterpretations 2020 Hilary Watt
+ PDF Chat Abandon Statistical Significance 2019 Blakeley B. McShane
David Gal
Andrew Gelman
Christian P. Robert
Jennifer L. Tackett

Works That Cite This (0)

Action Title Year Authors