At only 7 years old, this book is still highly relevant at showcasing the inherent problems of psychological science—which are rife across other areas of science as well. To me, none of these issues were new, so the book really caters more so to the educated reader who has not read much about many of these problems.
Chambers starts by discussing the inherent practices of confirmation bias in how psychology journals publish articles, namely by holding a strong preference for novel and positive findings, while eschewing negative findings and replication efforts (whether successful or not). This biases the literature majorly towards publishing false positive than could have arisen in numerous ways: through fluke/chance events due to under-powered studies, due to HARKing (hypothesizing after the results are known), and by failing to publish negative findings (an instance of the file drawer effect, aka publication bias). These problems are now well-known among practitioners, but there’s still a grossly insufficient effort to combat these issues on the whole. One distinction that Chambers helpfully makes is that of methodological replication vs conceptual replication: whereas the former attempts to replicate a prior experiment using as close to the same methods as possible, the latter attempts to replicate the findings using different methods, thereby assuming the original findings are real discoveries.
Chapter 2 is all about p-hacking: the use of unsound research practices to inflate the chances of getting a statistically significant finding. Chambers runs down the myriad ways in which p-hacking can occur, often unintentionally, through exploiting the many “research degrees of freedom” that researchers typically have when running a study. This phenomenon is observable in meta-science papers that show clusters of p-values just below the typical alpha value cut-off of 0.05 (the statistical significance threshold). This clustering issue has also apparently increased between 1965-2005 (Legget et al, 2013). Chambers concludes the chapter by very quickly touching on methods to combat these many questionable research practices: study pre-registration to combat publication bias, p-curves to combat p-hacking, disclosure statements to limit researcher degrees of freedom, data sharing to improve data transparency, and others. He goes much more deeply into these solutions in the final chapter.
Chapter 3 is about unreliability, which seemed like an extension of the previous two chapters. Chambers demonstrates the hostility that replication efforts face in many parts of the culture of academic psychologists, e.g., when 3 different attempts with larger samples failed to replicate John Bargh’s research on priming, Bargh doubled-down, accusing the scientists of incompetence or ill-informed. The sheer lack of replication, hostility towards it, and lack of incentives to attempt replications do not bode well for the reliability of psychological findings. Chambers then goes deeper into how underpowered studies, lack of disclosure about methods, and statistical misunderstandings all likewise undermine reliability. As an example, Chambers cites a study (Nieuwenhuis et al, 2011) that, in top neuroscience journals, the prevalence of a particular statistical misunderstanding—assuming that a statistically significant and a statistically non-significant effect are themselves significantly different—was 50%!
Chapter 4 is about data sharing—or the lack of it. Though practices have been improved, cultural and institutional norms are still clearly not ideal. Most journals still do not enforce data sharing/code sharing as a condition of publication, though some do. It’s pretty easy to see why failure to share such information harms replicability and reliability. Yet, many researchers still eschew this ethical responsibility. Towards the end, Chambers produces a list of tongue-in-cheek responses that surely accurately characterizes many researchers’ attitudes, e.g., “I’m not sharing my data because I can’t be bothered to organize the files in a way that makes sense to anyone else—in fact, they probably won’t even make sense to me when I look back at them in six months’ time.” (p. 102)
Chapter 5 dives deep into outright fraud and what distinguishes fraud from mere scientific misconduct, and misconduct from poor research practices. We’re told the now infamous story of Dutch social psychologist Diederik Stapel, who had over 50 papers retracted after having been discovered to falsify data routinely—for which he’s admitted and apologized. Chambers insinuates that the rate of true fraud/misconduct surely exceeds that which has been discovered. This discussion feeds into chapter 6’s discussion of open-access journals: journals that charge publishers rather than readers, making all content available to anyone for free. While this solves some problems, it has contributed to others, such as the litany of predatory open-access journals.
Chapter 7 looks at using scientific principles to assess itself (meta-science). Chambers look at the main metrics used to quantify quality, namely the journal impact factor (JIF). As Chambers demonstrates, these metrics have become a distorted signal of quality work, invoking Goodhart’s Law that “when a measure becomes a target, it ceases to be a good measure.” As supporting evidence, Chambers notes how the JIF is a measure of the average citations across all articles in a journal, but that the distribution of citations is highly skewed, with most articles getting no citations and a small percentage getting a lot. Moreover, journals can apparently be lobbied to inflate the JIF by getting them to disqualify certain articles from counting in the denominator. Moreover, Chambers cites evidence (Brembs et al, 2013) that JIF is unrelated to statistical power, an indicator of study quality (though one could argue that it is ought to be a primary indicator of quality). He then concludes that “Despite all the evidence that JIF is more- or- less worthless, the psychological community has become ensnared in a groupthink that lends it value.” This is probably better framed as a collective action problem (which Chambers does allude to later): since everyone regards everyone else considering JIF as an importance metric of reputability, it would cost an individual scientist to ignore it by publishing in journals with lower (or no) JIFs.
Chambers goes on to further rebuke using grant awards and authorship rank as meaningful indicators of scientific quality. The overall picture is pretty clear: psychology—and many other fields of science, no doubt—use misleading indicators to propel scientific publication, incentivizing the wrong kinds of behaviours and disincentivizing careful, high-quality output.
The final chapter goes into solutions. Preregistering study protocols—via Registered Reports—can be expected (and has shown) to improve publication bias for hypothesis-driven science, yet it has often been met with avoidance and even active pushback. If null results populate some journals more than others, won’t that make them less cited and thus less competitive in the journal market? One difficulty in transforming publishing norms is that it often requires buy-in from most or at least a large chunk of journals at once so that none expect a disadvantage to occur to them for being the first mover. It is hard to honestly argue against this solution, but Chambers rebuts many common objections to pre-registration.
Overall, this was a thorough, comprehensive, and meticulous look at the many problems with scientific publishing in psychology with the ultimate aim of improving its rigor and reliability. As a statistician, I was very satisfied with Chambers’ treatment of the many statistical errors that plague the field and his lucid explanations of how to improve statistical practices. Very much recommended.