Journal Club · Methods & Evidence
P-values: A Chronic Conundrum
Why “statistically significant” is not the same thing as true, important, or clinically useful.
A short annotated reading note for trainees on Gao’s open-access paper about p-value interpretation, calibrated p-values, and why journal club should spend less time worshipping p < 0.05 and more time asking what the result actually means.
Read the open-access paperWhy This Paper
P-values are everywhere in clinical research, but they are often treated as if they answer a question they were never designed to answer. A small p-value does not tell you the probability that the null hypothesis is true, the probability that the result is due to chance, the probability that a treatment works, or the size of the effect.
That is why this paper is a useful first Journal Club entry. It gives trainees a concrete language for the moment that happens in almost every discussion: someone points to p = 0.04, the room relaxes, and the hard questions stop too early.
A p-value is a measure of how incompatible the observed data are with a specified null model; it is not the probability that the treatment works, the result is real, or the null hypothesis is false.
What The Paper Fixes
1.The Misinterpretation
Gao’s core target is the common clinical translation error: treating a p-value as if it were a direct probability of truth. A result with p = 0.05 does not mean there is only a 5% chance the treatment does not work. It means that, if the null hypothesis and model assumptions were true, data at least this extreme would occur 5% of the time.
2.Fisher vs. Neyman-Pearson
The paper’s most useful teaching move is separating two traditions that are often blended together in medical training. Fisher’s significance testing uses the p-value as a graded measure of evidence against the null. Neyman-Pearson hypothesis testing is a decision framework built around pre-specified long-run error rates, including alpha and beta.
The confusion starts when we borrow the p-value from one framework and talk about it as if it were the type I error rate from the other. Alpha is chosen before the study. The p-value is calculated after the data are observed. They are related, but they are not the same object.
What p-values can do
Flag how surprising the observed data would be under a specified null model, assuming the model and analysis are appropriate.
What p-values cannot do
Measure treatment importance, effect size, causal validity, study quality, reproducibility, or the probability that the result is true.
3.The Calibrated P-value Idea
Gao highlights calibrated p-values as a way to make the gap between “p-value” and “probability the null is true” harder to ignore. Using the Sellke-Bayarri-Berger lower-bound calibration, a conventional p-value maps to a much larger minimum error probability than many clinicians intuitively assume.
| Reported p-value | Lower-bound calibrated probability | Teaching point |
|---|---|---|
| 0.05 | 28.9% | Not “5% chance the treatment does not work.” Much weaker than many readers assume. |
| 0.01 | 11.1% | Stronger evidence, but still not certainty. |
| 0.005 | 6.7% | A stricter threshold reduces overconfidence, but does not replace clinical judgment. |
| 0.001 | 1.8% | Very strong incompatibility with the null model, assuming the study design and analysis are sound. |
This calibration is not a magic replacement for critical appraisal. It is better understood as a teaching tool: it forces the reader to stop treating 0.05 as if it were a clean diagnostic test for truth.
How To Use It In Journal Club
4.The Better Reading Sequence
The best use of this paper is not to make trainees anti-p-value. It is to make them p-value literate. P-values are useful as one signal, especially against randomness, but they should sit inside a larger appraisal framework.
Questions To Ask Before Trusting The P-value
- Was this the primary endpoint or one of many outcomes and subgroup comparisons?
- Was the analysis pre-specified, or did the paper wander until it found significance?
- What is the effect size, and is it clinically meaningful?
- What does the confidence interval still allow to be true?
- Are the model assumptions plausible for this design and dataset?
- Do harms, cost, feasibility, and patient preference change the meaning of the result?
- If the p-value were 0.051 instead of 0.049, would the clinical interpretation honestly change?
5.The Clinical Reading Pearl
Medicine is not just a truth-detection problem. It is also a consequence problem. A borderline p-value means something different when the intervention is cheap and harmless than when it is expensive, toxic, irreversible, or life-changing. Gao’s examples make this point clearly: the same evidentiary uncertainty should be weighed differently depending on the downside of being wrong.
6.What To Be Careful About
The calibrated p-value argument is helpful, but it is not universally adopted as a reporting standard. It should not become a new ritual replacing the old ritual. The bigger lesson is the habit: do not dichotomize evidence, do not confuse statistical significance with clinical importance, and do not stop reading at the abstract’s p-value.
Bottom Line
- Use p-values as graded evidence against a specified null model, not as probabilities of truth.
- Separate Fisher-style evidence from Neyman-Pearson long-run decision rules.
- Remember that p = 0.05 is much less reassuring than many clinicians think.
- Never interpret a p-value without the effect size, confidence interval, endpoint hierarchy, and clinical consequence.
- The most important journal club move is often asking, “Would this result change what we do for a patient?”
Selected References
- Gao J. P-values — a chronic conundrum. BMC Medical Research Methodology. 2020;20:167. PMCID: PMC7315482 The article discussed on this page.
- Wasserstein RL, Lazar NA. The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016;70(2):129-133. doi:10.1080/00031305.2016.1154108 The concise consensus statement every trainee should know.
- Goodman S. A dirty dozen: twelve p-value misconceptions. Seminars in Hematology. 2008;45(3):135-140. PMID: 18582619 A useful companion piece for common interpretation errors.