Journal Club · Methods & Evidence

P-values: A Chronic Conundrum

Why “statistically significant” is not the same thing as true, important, or clinically useful.

A short annotated reading note for trainees on Gao’s open-access paper about p-value interpretation, calibrated p-values, and why journal club should spend less time worshipping p < 0.05 and more time asking what the result actually means.

Gao J BMC Med Res Methodol · 2020 PMID: 32580765 PMCID: PMC7315482

Read the open-access paper

Why This Paper

P-values are everywhere in clinical research, but they are often treated as if they answer a question they were never designed to answer. A small p-value does not tell you the probability that the null hypothesis is true, the probability that the result is due to chance, the probability that a treatment works, or the size of the effect.

That is why this paper is a useful first Journal Club entry. It gives trainees a concrete language for the moment that happens in almost every discussion: someone points to p = 0.04, the room relaxes, and the hard questions stop too early.

One-Line Takeaway

A p-value is a measure of how incompatible the observed data are with a specified null model; it is not the probability that the treatment works, the result is real, or the null hypothesis is false.

Part I

What The Paper Fixes

1.The Misinterpretation

Gao’s core target is the common clinical translation error: treating a p-value as if it were a direct probability of truth. A result with p = 0.05 does not mean there is only a 5% chance the treatment does not work. It means that, if the null hypothesis and model assumptions were true, data at least this extreme would occur 5% of the time.

Journal Club Translation When someone says “this was significant,” the next sentence should be: significant for what endpoint, under what model, with what effect size, with what confidence interval, and with what clinical consequence?

2.Fisher vs. Neyman-Pearson

The paper’s most useful teaching move is separating two traditions that are often blended together in medical training. Fisher’s significance testing uses the p-value as a graded measure of evidence against the null. Neyman-Pearson hypothesis testing is a decision framework built around pre-specified long-run error rates, including alpha and beta.

The confusion starts when we borrow the p-value from one framework and talk about it as if it were the type I error rate from the other. Alpha is chosen before the study. The p-value is calculated after the data are observed. They are related, but they are not the same object.

What p-values can do

Flag how surprising the observed data would be under a specified null model, assuming the model and analysis are appropriate.

What p-values cannot do

Measure treatment importance, effect size, causal validity, study quality, reproducibility, or the probability that the result is true.

3.The Calibrated P-value Idea

Gao highlights calibrated p-values as a way to make the gap between “p-value” and “probability the null is true” harder to ignore. Using the Sellke-Bayarri-Berger lower-bound calibration, a conventional p-value maps to a much larger minimum error probability than many clinicians intuitively assume.

Reported p-value	Lower-bound calibrated probability	Teaching point
0.05	28.9%	Not “5% chance the treatment does not work.” Much weaker than many readers assume.
0.01	11.1%	Stronger evidence, but still not certainty.
0.005	6.7%	A stricter threshold reduces overconfidence, but does not replace clinical judgment.
0.001	1.8%	Very strong incompatibility with the null model, assuming the study design and analysis are sound.

This calibration is not a magic replacement for critical appraisal. It is better understood as a teaching tool: it forces the reader to stop treating 0.05 as if it were a clean diagnostic test for truth.

Part II

How To Use It In Journal Club

4.The Better Reading Sequence

The best use of this paper is not to make trainees anti-p-value. It is to make them p-value literate. P-values are useful as one signal, especially against randomness, but they should sit inside a larger appraisal framework.

Questions To Ask Before Trusting The P-value

Was this the primary endpoint or one of many outcomes and subgroup comparisons?
Was the analysis pre-specified, or did the paper wander until it found significance?
What is the effect size, and is it clinically meaningful?
What does the confidence interval still allow to be true?
Are the model assumptions plausible for this design and dataset?
Do harms, cost, feasibility, and patient preference change the meaning of the result?
If the p-value were 0.051 instead of 0.049, would the clinical interpretation honestly change?

5.The Clinical Reading Pearl

Medicine is not just a truth-detection problem. It is also a consequence problem. A borderline p-value means something different when the intervention is cheap and harmless than when it is expensive, toxic, irreversible, or life-changing. Gao’s examples make this point clearly: the same evidentiary uncertainty should be weighed differently depending on the downside of being wrong.

How I Would Say It To A Trainee The p-value is allowed into the room, but it does not get to chair the meeting. Read the endpoint, effect size, confidence interval, design, multiplicity, plausibility, and consequences before deciding what the paper should change.

6.What To Be Careful About

The calibrated p-value argument is helpful, but it is not universally adopted as a reporting standard. It should not become a new ritual replacing the old ritual. The bigger lesson is the habit: do not dichotomize evidence, do not confuse statistical significance with clinical importance, and do not stop reading at the abstract’s p-value.

Part III

Bottom Line

Use p-values as graded evidence against a specified null model, not as probabilities of truth.
Separate Fisher-style evidence from Neyman-Pearson long-run decision rules.
Remember that p = 0.05 is much less reassuring than many clinicians think.
Never interpret a p-value without the effect size, confidence interval, endpoint hierarchy, and clinical consequence.
The most important journal club move is often asking, “Would this result change what we do for a patient?”

Selected References

Gao J. P-values — a chronic conundrum. BMC Medical Research Methodology. 2020;20:167. PMCID: PMC7315482 The article discussed on this page.
Wasserstein RL, Lazar NA. The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016;70(2):129-133. doi:10.1080/00031305.2016.1154108 The concise consensus statement every trainee should know.
Goodman S. A dirty dozen: twelve p-value misconceptions. Seminars in Hematology. 2008;45(3):135-140. PMID: 18582619 A useful companion piece for common interpretation errors.