Will Lowering the P-Value Improve Quality of Research Standards?

Research, microscope, lab
Research, microscope, lab
Should the statistical significance of the P-value be lowered?

In a controversial and divisive article posted July 22, 2017 on the preprint server PsyArXiv, a group of 72 well-established researchers from the same number of institutions across the United States, Europe, Canada, and Australia in departments as diverse as psychology, statistics, social sciences, and economics, led by Daniel J. Benjamin, PhD, from the Center for Economic and Social Research and Department of Economics, University of Southern California, Los Angeles, propose to improve “statistical standards of evidence,” by lowering the P-value for significance from P <.05 to P <.005 in the fields of biomedical and social sciences.1 This article was published in September 2017 as a comment in Nature Human Behavior.2

Statistical significance set at P <.05 results in high rates of false-positives, note the authors, “even in the absence of other experimental, procedural and reporting problems,” and may underlie commonly encountered issues of lack of reproducibility.

In an open science collaboration published in Science in August 2015, 270 psychologists seeking to assess reproducibility in their field endeavored to replicate a total of 100 studies published in 3 high-impact factor journals in psychology during 2008.3

“Reproducibility is a defining feature of science,” remarked the investigators in the Science article’s introduction. Reproducibility was assessed using 5 parameters: “significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes.”

Surprisingly, the researchers found that “replication effects were half the magnitude of original effects, representing a substantial decline.” Replications led to significant results in just 36% of studies, and “47% of original effect sizes were in the 95% confidence interval of the replication effect size.” They conclude that “variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research.”

However, in a comment on this large-scale replication study published several months later, also in Science, by Daniel T. Gilbert, PhD, professor of psychology at Harvard University, Cambridge, Massachusetts, and colleagues, the psychologists argue that this article “contains 3 statistical errors, and provides no support for [the low rate of reproducibility in psychology studies].”4 The comment’s authors argue that, because results from the replication study were not corrected for error, power, or bias, “the data are consistent with the opposite conclusion, namely, that the reproducibility of psychological science is quite high.”

Similar issues were encountered by the biotechnology companies Amgen and Bayer Healthcare, which were able to replicate only 11% of 53 “landmark” preclinical studies and 25% of 67 studies (70% in oncology), respectively.5,6 One of the reasons cited in the Bayer study for this lack of reproducibility, is an “incorrect or inappropriate statistical analysis of results or insufficient sample sizes, which result in potentially high numbers of irreproducible or even false results.”6

Although several measures (including increased statistical power, multiple testing, and P-hacking) have been proposed to tackle the root cause of the perceived, justly or not, lack of reproducibility, the authors of the “Redefine statistical significance” article believe that any of these measures, by themselves or in combination, would not adequately address the issue. With such a lowering of the P-value, results with a P-value comprised between .05 and .005 would be classified as “suggestive” vs significant. The authors add that this is not a novel concept but, rather, a concept now endorsed by a “critical mass of researchers.”7,8

This new standard for statistical significance is meant for studies uncovering new effects vs replicative studies, and for studies in which statistical analyses use a null hypothesis. Although other options may also be employed in an effort to improve reproducibility, lowering of the P-value represents a simple measure, according to the authors, which would not require additional training by the research community, and might therefore gather broad consensus.

This article originally appeared on Clinical Pain Advisor