Commentary: Null points—has interpretation of significance tests improved?

Jonathan A C Sterne

Type: Letter

Publication Date: 2003-10-01

Citations: 4

DOI: https://doi.org/10.1093/ije/dyg274

Abstract

Perhaps the most striking impression gained in reading Berkson’s piece, 1 more than 60 years since its publication, is the author’s struggle with questions of interpretation that still plague those conducting and interpreting statistical analyses today. Berkson seems to make little progress with solutions to the problems he presents, so it is of interest to see how statisticians today might deal with them. A P-value (significance level) is used to assess evidence against a null hypothesis. If, as Berkson states, we do not ‘find experimentalists typically engaged in disproving things’ then why does the formulation of statistical questions in terms of null hypotheses and their falsification remain so pervasive? Of course, the idea of science as a process of falsification was articulated in detail by Popper, and remains an attractive explanation of why, for example, Newton’s laws of mechanics were accepted until Einstein proved that there were circumstances in which they did not hold. Nonetheless Berkson argues forcefully that the usual discussion of evidence takes the form of positive statements (‘Someone has been murdered’) rather than negative ones (‘… evidence against the null hypothesis that no one is dead’). Why, in medical and psychological statistics, do we remain so attached to the formulation of null hypotheses? In the context of randomized trials, it still seems reasonable to demand that those proposing that resources be spent on a particular treatment should, as a minimum, provide evidence against there being no treatment effect at all. Similarly, so many factors have been postulated over the years, in the pages of this and other epidemiology journals, to be associated with a multitude of disease outcomes that some quantification of the possible role of chance in explaining observed results is enduringly useful. Further, in choosing a statistical model it is inevitable that we make decisions about the inclusion or otherwise of different covariates, different forms of these covariates (linear, non-linear, categorical), and interactions between them. Such a process is difficult to conduct without some recourse to null hypotheses stating that certain parameter values in a more complex model are zero, and hence that a simpler model is appropriate. It thus remains the case that there are genuine reasons for consideration, in the reporting of our statistical analyses, of the extent to which the data are compatible with particular null hypotheses. However, confusion still reigns over how this should be assessed: often manifesting itself in a confused mixture of Fisherian significance testing and Neyman-Pearson hypothesis testing. 2 As discussed in more detail in the commentary by Stone, 3 Fisher emphasized that research workers interpret significance levels in the light of their wider knowledge of the subject. 4 In contrast, Neyman and Pearson attempted to replace the subjective interpretation of P-values with an objective, decisiontheoretic interpretation of results. 5 However, both methods are misused: Fisher’s in that null hypotheses are mechanistically rejected if P < 0.05, and Neyman and Pearson’s in that results are interpreted without consideration of the Type II error rate that should be used in defining the critical region within which values of the test statistic lead to rejection of the null hypothesis. As shown by Oakes, 6 interpretation of P-values depends on both the power of tests and on the proportion of null hypotheses that are truly false. 7 How would modern medical statistics deal with the problems that Berkson raises? There has been a struggle since the 1970s in the pages of general medical journals against the misinterpretation of ‘non-significant’ differences (referred to by Berkson as ‘middle P’s’) as providing evidence in favour of the null hypothesis. 8 We now understand that P-values alone cannot be used to interpret statistical analyses: we need to consider the magnitude of estimated associations, and to examine confidence intervals in conjunction with P-values to prevent ourselves from being misled. 9 For example, well-reported analyses of the ‘experiences’ presented in Berkson’s Table 1 might note that the odds of success in judging the sex of a fetus were 1.5 (95% CI: 0.42, 5.32) in Experience 1, compared with 1.02 (95% CI: 0.90, 1.15) in Experience 2. Examination of these results would lead most people to agree with Berkson’s informally derived conclusions, though we might disagree with Berkson in noting that Experience 2 leaves open the possibility of the physician being able to discriminate modestly better (or worse) than by chance alone. A Bayesian statistician might present the posterior distribution for the odds of success, based on combining the data with her prior distribution—(s)he would

Locations

International Journal of Epidemiology - View - PDF
PubMed - View

Similar Works

Action	Title	Year	Authors
+ PDF Chat	I can see clearly now: Reinterpreting statistical significance	2019	Jonathan Dushoff Morgan P. Kain Benjamin M. Bolker
+ PDF Chat	Null hypothesis significance testing: a guide to commonly misunderstood concepts and recommendations for good practice	2017	Cyril Pernet
+ PDF Chat	Null hypothesis significance testing: a guide to commonly misunderstood concepts and recommendations for good practice	2017	Cyril Pernet
+ PDF Chat	Will P‐Value Triumph over Abuses and Attacks?	2018	Jyotirmoy Sarkar
+ PDF Chat	P-values – a chronic conundrum	2020	Jian Gao
+ PDF Chat	P Values: From Suggestion to Superstition	2016	John Concato J. A. Hartigan
+ PDF Chat	Statistical significance versus clinical relevance	2016	Marieke H.C. van Rijn Anneke Bech Jean Bouyer Jan A.J.G. van den Brand
+	I can see clearly now: reinterpreting statistical significance	2018	Jonathan Dushoff Morgan P. Kain Benjamin M. Bolker
+ PDF Chat	Null hypothesis significance testing: a short tutorial	2016	Cyril Pernet
+ PDF Chat	Null hypothesis significance testing: a short tutorial	2016	Cyril Pernet
+ PDF Chat	Why not to (over)emphasize statistical significance	2019	Olaf M. Dekkers
+	I can see clearly now: reinterpreting statistical significance	2018	Jonathan Dushoff Morgan P. Kain Benjamin M. Bolker
+	When More Is Less: Pitfalls of significance testing	2022	Uwe Hassler
+	Evidence-based medicine or statistically manipulated medicine? Are we slaves to the <i>P</i>-value?	2024	Harsh Goel Divisha Raheja Sunil Nadar
+ PDF Chat	The ongoing tyranny of statistical significance testing in biomedical research	2010	Andreas Stang Charles Poole Oliver Kuß
+	Statistical significance or clinical significance? A researcher's dilemma for appropriate interpretation of research results	2021	Hunny Sharma
+	Statistical Significance and the Dichotomization of Evidence	2017	Blakeley B. McShane David Gal
+ PDF Chat	Interpreting Results from Statistical Hypothesis Testing: Understanding the Appropriate P-value	2022	Eiki Tsushima
+ PDF Chat	Reflection on modern methods: Statistics education beyond ‘significance’: novel plain English interpretations to deepen understanding of statistics and to steer away from misinterpretations	2020	Hilary Watt
+ PDF Chat	Abandon Statistical Significance	2019	Blakeley B. McShane David Gal Andrew Gelman Christian P. Robert Jennifer L. Tackett

Works That Cite This (0)

Action	Title	Year	Authors

Works Cited by This (6)

Action	Title	Year	Authors
+	Testing a Point Null Hypothesis: The Irreconcilability of<i>P</i>Values and Evidence	1987	James O. Berger Thomas Sellke
+	The Fisher, Neyman-Peerson Theories of Testing Hypotheses: One Theory or Two?	2011	E. L. Lehmann
+ PDF Chat	Bayesian Measures of Model Complexity and Fit	2002	David J. Spiegelhalter Nicola Best Bradley P. Carlin Angelika van der Linde
+	Teaching hypothesis tests – time for significant change?	2002	Jonathan A C Sterne
+	IX. On the problem of the most efficient tests of statistical hypotheses	1933	Jerzy Neyman Egon S. Pearson
+	The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?	1993	E. L. Lehmann