If a scientist flips a coin 10 times and it lands heads-up every time, intuition suggests that the coin is weighted. Indeed, the probability that a fair coin lands heads-up 10 times in a row is (0.5)10, or about 0.1 percent. It’s a small, but nonetheless non-zero, number. How should the scientist responsibly report these results? Does it suffice to say that the coin is maybe, possibly, probably weighted according to intuition?
Thankfully, scientists need not rely on such slippery language when reporting their results. Instead, they can leverage the concept of “statistical significance” to attach a quantitative measure to the likelihood that experimental results are meaningful or random. Significance reporting has been the standard for the last century, providing uniformity among expansive experimentation and data analysis. Recently, however, some scientists are rising against the convention. To understand why, it helps to start at the beginning.
The power of the P-value
In his 1925 book “Statistical Methods for Research Workers,” statistician Sir Ronald Aylmer Fisher devised a method for testing the strength of evidence in favor or against a scientific guess. A researcher must first break that scientific question into two hypothesis. The null hypothesis typically claims that the two test groups in the experiment are virtually the same, any differences arising through random chance. Conversely, the alternative hypothesis affirms a meaningful distinction. In order to determine what constitutes as truly meaningful, Fisher concocted a metric called the “p-value,” and he suggested the following interpretation: If the p-value is calculated to be smaller than 0.05, the null hypothesis is rejected. As the p-value climbs above 0.05 towards its upper limit of one, the null hypothesis fails to be rejected, and the experimental results are condemned to statistical insignificance.
Fisher’s convention is still the golden rule of hypothesis testing nearly one hundred years later. The ubiquity of the 0.05 benchmark is somewhat perplexing given its arbitrary origins. Fisher himself addresses the foolishness of using p < 0.05 as a one-size-fits-all significance level in his 1956 book “Statistical Methods and Scientific Inference.” Discussing the need for discretion in determining significance, Fisher writes, “the calculation is absurdly academic, for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”
Fisher was thorough in his qualification of the p-value. Circumstances that dictate whether results are meaningful vary with subject matter and time, and a stagnant significance level fails to take that context into account. So while Fisher saw 0.05 as a reasonable basepoint, he advised researchers to adjust their significance benchmark and supplement conclusions with a healthy dose of human interpretation. That way, no single number is left to speak for itself.
The pushback against the p-value
Despite what Fisher may have intended, the p-value has emerged as the sole arbitrator of scientific merit; a gatekeeper to publication in top journals. These same top journals, however, are recognizing growing discontentment with the dictatorial reign of p < 0.05. In their Nature comment piece published in March of this year, Valentin Amrhein, Sander Greenland, Blake McShane, and over 800 other signatories make the case for retiring statistical significance. The scientists—hailing from a variety of disciplines—warn that “statistically significant” and “statistically non-significant” are too often seen as conflicting conclusions. Indeed, in some cases two studies could yield the same observed effect, but a different significance level would lead to different results.
Amrhein’s letter continues on to discuss the interpretative issues that arise for non-significant results. In a study that combed through 791 articles spanning 5 journals, a little more than half (51 percent) wrongly interpreted not being able to find a relationship as there being no relationship at all. In reality, non-significant results simply indicate that the null hypothesis fails to be rejected. To extrapolate any further meaning from those results is pure conjecture.
Redefining statistical significance
Amrhein and his colleagues argue that statistical significance should be abandoned altogether. They believe that any dichotomous system would necessarily be misused, as it makes it easy to forget that significance actually exists as a spectrum. In place of statistical significance, Amrhein et al. advocate for the implementation of “compatibility intervals,” which is admittedly more of a rebranding than a revolution. The novelty of this alternative lies in interpretation: All values within the interval are considered “reasonably compatible” with the point estimate, and compatibility is a function of distance from that central number. This more inclusive system, they believe, would thwart unnecessary conflicts between experiments seeking to replicate results but achieving different bounds of confidence.
Some, like John Ioannidis of Stanford University, do not see the merit in this proposed change. In his response to Amrhein et al.’s paper, Ioannidis postulates that compatibility intervals would be “potentially confusing—and biases could render the entire interval incompatible with truth.” Abandoning the p-value certainly makes drawing conclusions more of a subjective rather than quantitative endeavour, and it is yet unknown whether that freedom would be abused in practice.
Amrhein et al. are not the only ones proposing changes to the significance convention. Some actually aim to make significance more exclusive. In a 2017 paper by another interdisciplinarily exhaustive group of scientists, Benjamin et al. make the case for decreasing the significance threshold from 0.05 to 0.005. Results falling between 0.005 and 0.05 would instead given the new categorization of “suggestive,” a soft way of stating uncertainty one way or another. A more exclusive benchmark would demand more convincing evidence from all experiments, and while Benjamin et al. reason that this higher burden of proof would enhance both the credibility and reproducibility of conclusions, there arises the same issue of context Fisher tackled in 1956. A threshold of 0.005 is as blind to time and place as 0.05, so credibility may not improve in all cases.
Statistical significance has grown into an issue as complex as the research it seeks to categorize. Hypothesis testing is among the first principles taught in any statistics course. Scientists are taught, either outright or through practice, that they cannot escape the p-value. If small enough, the p-value grants funding and publication. Otherwise, it can be a death sentence. It is precisely the omnipresence of statistical significance that would make it difficult to upend. Nonetheless, we can expect scientists like Amrhein, Ioannidis, Benjamin, and their coauthors to continue advocating for improved alternatives. Such is a testament to the passion and ingenuity of the scientific community, a phenomenon whose significance requires no test.
Fisher, R.A. (1925) Statistical Methods for Research Workers, 13th ed., Edinburgh: Oliver and Boyd.
Fisher, R.A. (1956) Statistical Methods and Scientific Inference, Oxford, England: Hafner Publishing Co.
Amrhein, V., Greenland, S., McShane, B. (2019) ‘Scientists rise up against statistical significance.’ Nature, 20 Mar, available: https://www.nature.com/articles/d41586-019-00857-9
Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E., Berk, R., … Johnson, V. (2017) ‘Redefine statistical significance.’ PsyArXiv, 22 July, available: https://psyarxiv.com/mky9j/?_ga=2.83175444.36461547.1556912282-865891723.15569 12282