Type: Letter
Publication Date: 2003-10-01
Citations: 4
DOI: https://doi.org/10.1093/ije/dyg274
Perhaps the most striking impression gained in reading Berkson’s piece, 1 more than 60 years since its publication, is the author’s struggle with questions of interpretation that still plague those conducting and interpreting statistical analyses today. Berkson seems to make little progress with solutions to the problems he presents, so it is of interest to see how statisticians today might deal with them. A P-value (significance level) is used to assess evidence against a null hypothesis. If, as Berkson states, we do not ‘find experimentalists typically engaged in disproving things’ then why does the formulation of statistical questions in terms of null hypotheses and their falsification remain so pervasive? Of course, the idea of science as a process of falsification was articulated in detail by Popper, and remains an attractive explanation of why, for example, Newton’s laws of mechanics were accepted until Einstein proved that there were circumstances in which they did not hold. Nonetheless Berkson argues forcefully that the usual discussion of evidence takes the form of positive statements (‘Someone has been murdered’) rather than negative ones (‘… evidence against the null hypothesis that no one is dead’). Why, in medical and psychological statistics, do we remain so attached to the formulation of null hypotheses? In the context of randomized trials, it still seems reasonable to demand that those proposing that resources be spent on a particular treatment should, as a minimum, provide evidence against there being no treatment effect at all. Similarly, so many factors have been postulated over the years, in the pages of this and other epidemiology journals, to be associated with a multitude of disease outcomes that some quantification of the possible role of chance in explaining observed results is enduringly useful. Further, in choosing a statistical model it is inevitable that we make decisions about the inclusion or otherwise of different covariates, different forms of these covariates (linear, non-linear, categorical), and interactions between them. Such a process is difficult to conduct without some recourse to null hypotheses stating that certain parameter values in a more complex model are zero, and hence that a simpler model is appropriate. It thus remains the case that there are genuine reasons for consideration, in the reporting of our statistical analyses, of the extent to which the data are compatible with particular null hypotheses. However, confusion still reigns over how this should be assessed: often manifesting itself in a confused mixture of Fisherian significance testing and Neyman-Pearson hypothesis testing. 2 As discussed in more detail in the commentary by Stone, 3 Fisher emphasized that research workers interpret significance levels in the light of their wider knowledge of the subject. 4 In contrast, Neyman and Pearson attempted to replace the subjective interpretation of P-values with an objective, decisiontheoretic interpretation of results. 5 However, both methods are misused: Fisher’s in that null hypotheses are mechanistically rejected if P < 0.05, and Neyman and Pearson’s in that results are interpreted without consideration of the Type II error rate that should be used in defining the critical region within which values of the test statistic lead to rejection of the null hypothesis. As shown by Oakes, 6 interpretation of P-values depends on both the power of tests and on the proportion of null hypotheses that are truly false. 7 How would modern medical statistics deal with the problems that Berkson raises? There has been a struggle since the 1970s in the pages of general medical journals against the misinterpretation of ‘non-significant’ differences (referred to by Berkson as ‘middle P’s’) as providing evidence in favour of the null hypothesis. 8 We now understand that P-values alone cannot be used to interpret statistical analyses: we need to consider the magnitude of estimated associations, and to examine confidence intervals in conjunction with P-values to prevent ourselves from being misled. 9 For example, well-reported analyses of the ‘experiences’ presented in Berkson’s Table 1 might note that the odds of success in judging the sex of a fetus were 1.5 (95% CI: 0.42, 5.32) in Experience 1, compared with 1.02 (95% CI: 0.90, 1.15) in Experience 2. Examination of these results would lead most people to agree with Berkson’s informally derived conclusions, though we might disagree with Berkson in noting that Experience 2 leaves open the possibility of the physician being able to discriminate modestly better (or worse) than by chance alone. A Bayesian statistician might present the posterior distribution for the odds of success, based on combining the data with her prior distribution—(s)he would
Action | Title | Year | Authors |
---|