When reviewing the results of a quantitative study a significance of p .05 indicates?

Conducting your data analysis and drafting your results chapter are important milestones to reach in your dissertation process. The light is finally shining on you from the end of the tunnel, and you are winding down. With only two chapters to go, you are finally feeling relieved… until you get the output from your data analysis. What do these numbers mean and what should you do with them?

With p values, t values, F values, correlation coefficients, and a bunch of other numbers staring at you, it is easy to get discouraged. However, the basic question you need to answer, do I or do I not have statistical significance, can be answered looking at one simple number: the p value.

When reviewing the results of a quantitative study a significance of p .05 indicates?

Discover How We Assist to Edit Your Dissertation Chapters

Aligning theoretical framework, gathering articles, synthesizing gaps, articulating a clear methodology and data plan, and writing about the theoretical and practical implications of your research are part of our comprehensive dissertation editing services.

  • Bring dissertation editing expertise to chapters 1-5 in timely manner.
  • Track all changes, then work with you to bring about scholarly writing.
  • Ongoing support to address committee feedback, reducing revisions.

Before you can determine if you have rejected or failed to reject your null hypothesis, you must designate the maximum probability of falsely rejecting the null hypothesis that you are willing to accept in your analysis. This is referred to as the alpha level and is typically set at .05 in social science research. An alpha level of .05 means that you are willing to accept up to a 5% chance of rejecting the null hypothesis when the null hypothesis is actually true. Depending on your field of study and the nature of your analysis, you may choose to decrease or increase the alpha level to make the decision point more or less stringent.

Once you conduct your analysis, you will get a p value, also called a significance (Sig.) value. Your statistical software package will return this number to you once you conduct your analysis. This number reflects the probability of obtaining results as extreme as what you obtained in your sample if the null hypothesis was true.

Let’s give this concept some legs with an example. Our research question asks: are there differences in the number of women hired in higher education institutions by region? The null hypothesis we are testing is: there are no differences in the number of women hired by region. We conducted a one-way ANOVA in order to compare the number of women hired in each region. The alpha level for this analysis is .05. We conducted the analysis in SPSS and got the following output:

ANOVA

 

Sum of Squares

df

Mean Square

F

Sig.

Between Groups

182800.683

2

91400.342

1.132

.375

Within Groups

565205.417

7

80743.631

   
Total

748006.100

9

     

We have a p value (Sig.) of .375. This number exceeds our alpha level of .05. Therefore, we fail to reject the null hypothesis. We did not find statistically significant differences in the number of women hired in higher education institutions by region. If the p value had been less than .05, we would have rejected the null hypothesis.

A hot debate is long going on in major journals about p value and statistical significance. On the one side, those who “rise up against statistical significance”, as did Amrhein et al. in Nature [1]; on the other side, those who recommend “do not abandon significance”, as did Ioannidis in JAMA [2]. These are two opposite viewpoints around a long-standing topic: the misuse of p value and the claim of statistical significance or non-significance.

The p value is probably the most ubiquitous and, at the same time, misunderstood index in all of biomedical research. Hundreds of articles were published on this topic since at least the 1940s [3], with proposals of alternatives to the classic p value with its historical 0.05 threshold. One of these proposals, which has gained some attention, came in 2018 from Benjamin et al. [4], who suggested to lower the threshold for statistical significance from 0.05 to 0.005. However, despite this debate, clinical researchers still make large use of the classic 0.05 threshold.

In this article, we discuss the value of p value and explain why it should not be abandoned nor should the conventional threshold of 0.05 be modified. In addition, we show how artificial intelligence applications to radiomics and big data, involving hundreds/thousands of tested features, prompt reductions in the significance threshold, based on well-known corrections for multiple testing.

Opposite opinions on the p value

In 2016, the American Statistical Association (ASA) released a statement warning against the misuse of statistical significance and p values [5]. A special issue in The American Statistician [5] presented 37 papers on “Statistical inference in the 21st century: a world beyond p < 0.05”.

The main ASA points are highlighted in the form of do nots, as follows:

  • Do not base your conclusions solely on whether an association or effect was found to be statistically significant (i.e., the p value passed some arbitrary threshold such as p < 0.05);

  • Do not believe that an association or effect exists just because it was statistically significant;

  • Do not believe that an association or effect is absent just because it was not statistically significant;

  • Do not believe that your p value gives the probability that chance alone produced the observed association or effect or the probability that your test hypothesis is true;

  • Do not conclude anything about scientific or practical importance based on statistical significance (or lack thereof).

According to Wasserstein et al. [6], the statement statistically significant has today become meaningless. In 1885, Edgeworth’s original intention for statistical significance was simply to have a tool to indicate when a result warrants further scrutiny; statistical significance was never meant to imply scientific importance, but that idea has been irretrievably lost [7]. Yet a century later, the confusion persists. Such doubts can lead to radical choices, such as the one taken by the Editors of Basic and Applied Social Psychology, who decided to ban p values in 2015 [8].

Wasserstein et al. [6] also say that “no p value can reveal the plausibility, presence, truth, or importance of an association or effect”. Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important, nor “does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant”. Furthermore, this false split into worthy and unworthy results leads to the selective reporting and publishing of results based on their statistical significance, the so-called publication bias. Similarly, Wasserstein et al. [6] also suggest to stop using confidence intervals (CIs) as another means of dichotomisation, based on whether a null value falls within the interval. However, despite these considerations, the ASA does not recommend to stop the p value calculation. The two sides of the debate may be conciliated. Just, when p values are used, they should be reported as continuous quantities, not claiming significance or non-significance. The ASA position may be summarised with the sentence “as statistical significance is used less, statistical thinking will be used more”.

Opposite to the ASA position is that of Ioannidis [2] and other authors [9, 10] who exhort not to abandon statistical significance. We agree with this viewpoint. We acknowledge the importance of embracing uncertainty, avoiding hyped claims, and recognising that the p value is often poorly understood, but statistical significance, in our opinion, has a crucial practical importance. Inferences are unavoidably dichotomous, especially in medicine and healthcare, both in preclinical (experimental) and clinical research. Any intervention, such as a new drug or imaging technique, will either be licensed or not. As such, we do need a methodology that allows for dichotomisation. Although the policymakers base their decisions also on other factors (e.g., economics), scientific evidence with roots into statistical significance should help in doing so. We believe that there is a need for inductive thinking and statistical tools to support inferences. If the bar would be removed, any difference may be claimed to reflect an important effect. In front of results from a well-conducted study not using dichotomisation, policymakers would ask “So, what?”.

On the side of those favouring the use of p value and its dichotomisation, Benjamin et al. [4] and Ioannidis [11] rather proposed to lower the significance threshold from the conventional 0.05 to 0.005. This proposal comes essentially with the aim to reduce the false discovery rate, i.e., the rate of claims of associations or effects that are later not replicated. According to Ioannidis [11], “the reduction in the p value threshold may largely do more good than harm, despite also removing an occasional true and useful treatment effect from the coveted significance zone”. In our opinion, the reduction of the threshold to one tenth of the current one has more disadvantages than advantages, at least for clinical research.

The threshold for significance and its origin

The above-mentioned five points from ASA are surely relevant. To better understand their meaning, we need to place them into the historical context. In the theoretical system proposed by Ronald A. Fisher (1890–1962), the p value had to be considered only as a rough guide of the strength of evidence against the null hypothesis. In other words, the meaning of p < 0.05 was merely that one should repeat the experiment. If subsequent studies also yielded significant p values, one could conclude that the observed effects were unlikely to be solely the result of chance.

For decades, 0.05 (5%, i.e., 1 of 20) has been conventionally accepted as the threshold to discriminate significant from non-significant results, inappropriately translated into existing from not existing differences or phenomena. This cutoff has peculiar reasons. Early in the 1900s, statistics textbooks reported many tables with long series of p values. Fisher shortened the tables previously published by Karl Pearson (1857–1936), not only for reasons of editorial space but probably also for copyright reasons (it seems that Fisher and Pearson were not on good terms). Some p values were selected and became more important than others, as Fisher wrote for researchers (the users) and not for expert in statistics (the theoreticians). Fisher himself provided a selection of probabilities which simplified the choice to help in decision-making [12] and attributed a special status to 0.05, asserting explicitly that “the value for which p = 0.05, or 1 in 20, is 1.96 or nearly 2. It is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not” [13].

Research methodology in medicine is such that a comparison between two or more datasets (groups) is typically performed in terms of a given endpoint. Examples include the comparison between the efficacy of a new drug/treatment versus the established drug/treatment or placebo in two different animal or human population samples, or the comparison between the diagnostic accuracy of different imaging techniques. For a reason or another, a study commonly comes up with different values for the measured endpoints in two (or more) groups, and researchers need to ascertain if the observed difference is actually due to random sampling or, instead, reflects a real difference among the groups [14].

These comparisons are typically carried out through one or more statistical tests. Briefly, statistical tests are based on the following question: if there is not an existing difference between the two compared groups, what is the probability to obtain the observed difference or a larger one that is only due to random sampling? Being the answer to this question a probability in nature, the famous p value, it is never 1 (100%) or 0 (0%), but always something in the middle. As such, it does not provide the probability for the null hypothesis (i.e., that there is not a real difference between the two groups) to be true, but rather reflects the degree of compatibility of the data with the null hypothesis. Thus, the decision to reject the null hypothesis must necessarily be based on a threshold defined a priori. In practice, the smaller the calculated p value, the more we consider the null hypothesis to be improbable; consequently, the smaller the p value, the more we consider the alternative hypothesis to be probable (i.e., that the groups are indeed different) [14].

The threshold: to lower or not to lower?

Ioannidis based his proposal to reduce the threshold to 0.005 on a previous article in PLoS Medicine [15] presenting a theoretical framework relating the post-study probability of a research hypothesis to be true to the pre-study probability. The author dramatically stated that “most published research claims are false”, thus receiving extensive attention. Goodman and Greenland [16] have already argued that “the mathematical argument in the PLoS Medicine paper underlying the proof of the title’s claim has a degree of circularity” and that “the claims that the model employed in this paper constitutes a proof that most published medical research claims are false […] are unfounded”.

According to Ioannidis [11], “moving the p-value threshold from .05 to .005 will shift about one-third of the statistically significant results of past biomedical literature to the category of just suggestive”. We think that such a solution makes biomedical research harder and that, adopting this solution, an improvement in research quality is not granted. Lowering this way the p value threshold for significance is, at best, a palliative solution. Especially in clinical research, future trials would need to be larger, less feasible, and more expensive. Achieving 80% power with a threshold of 0.005, instead of 0.05, would require a 70% larger sample size for between-subject study designs with two-sided tests (88% for one-sided tests) [17]. Researchers could abandon some good ideas with the net effect of depressing spontaneous, investigator-initiated research. Only the few treatments with large effect sizes would gain the evidence, thanks to the statistical power granted by medical industries. Conversely, treatments with a small yet clinically appreciable effect would be hardly proven as effective. This would be amplified in studies where the outcome of interest is quite infrequent such as interval cancers in breast cancer screening mammography, cardiovascular events, or cancer recurrence in longitudinal studies. Not to mention rare diseases or those studies where the tested treatment or diagnostic tool is invasive and poses ethical or organisational issues.

These arguments are not new. An article signed by 54 authors [18] has provided a similar view, with a deeper technical explanation on why the statements by Ioannidis [15] and Benjamin et al. [4] are unjustified. Another article, signed by 88 authors [17], has questioned the idea that the significance threshold should be based on the amount of relative evidence indicated by Bayesian factorsFootnote 1, as done by Benjamin et al. [4], whose assumptions were considered to be unjustified.

A major point deserving further comments is the lack of replicability/reproducibility of study results [4]. Causes of this include multiple testing, p-hacking, publication bias, and underpowered studies. As Gregg Easterbrook said, “Torture numbers, and they will confess to anything” (https://todayinsci.com/E/Easterbrook_Gregg/EasterbrookGregg-Quotations.htm). A famous example of non-replication came in 2006, when a group of researchers presented an algorithm using genomic microarray data that predicted cancer patients being responders to chemotherapy [19]. This paper drew immediate attention. Two statisticians later obtained the publicly available data and attempted to apply the algorithm [20]. What they found was a very poorly conducted data analyses, with errors ranging from trivial to devastating. It was not until 2011 that the original study was retracted from Nature Medicine.

Interestingly, the probability to replicate a study showing significant results has been estimated to be only 62–67% for a statistical power of 80%, or 69–76% for a statistical power of 90% [21]. To note, 2/3 (i.e., 66.7%) “is the probability Laplace derived for repeating a successful event when the first event emerged against a background of perfect ignorance” [21].

To better understand the impact of lowering the significance threshold from 0.05 to 0.005, we show some examples of articles leading to 0.005 < p < 0.05 that were later considered as relevant evidence guiding clinical practice.

Examples of studies leading to 0.005 < p < 0.05

From the table of evidence in support of the recommendations of the 2019 ACC/AHA Guideline on the primary prevention of Cardiovascular Disease–A Report of the American College of Cardiology/American Heart Association Task Force on Clinical practice Guidelines [22], we extracted the two following examples (overall, in this guideline, there are dozens of original articles relying on a 0.005 < p < 0.05). The Multi-Ethnic Study of Atherosclerosis [23] demonstrated, among other things, that early menopause (below 46 years) is a moderate independent predictor of coronary heart disease and stroke in a diverse population of women in the USA. In particular, the incidence of hard coronary events was 7.33/1,000/year in women with early menopause and 3.22/1,000/year in women without, for a fully adjusted hazard ratio (HR) of 1.85 (95% CI 1.01–3.37), p = 0.045. The fully adjusted HR for stroke was 2.03 (95% CI 1.00–4.10), p = 0.049. Notably, the study prospectively enrolled 2,509 women but would have had to enrol about 4,265 women (2,509 increased by 70%; see ref [17]) had the significance threshold lowered to 0.005, with inherent higher costs and time needed to complete the study. The Women’s Health Initiative Observational Study [24] demonstrated that among apparently healthy subjects aged 50–79, sitting time ≥ 10 h/day (versus ≤ 5 h/day) was moderately associated with risk of coronary heart disease, for an adjusted HR of 1.13 (95% CI 1.01–1.26), p = 0.04. For stroke, the HR was 1.18 (95% CI 1.04–1.34), p = 0.008.

From the European Society of Breast Imaging (EUSOBI) recommendations for women’s information on breast magnetic resonance imaging, we extracted the following example. The High Breast Cancer Risk Italian study [25] evaluated 501 women at high risk of breast cancer and demonstrated that the incidence in women with previous history of breast cancer (29/674, 4.3%, 95% CI 2.9–6.1%) was significantly higher than that in women without (23/918, 2.5%, 95% CI 1.6–3.7%), p = 0.045. As this population is hard to collect, the study involved 18 centres for more than 7 years. This would have been not feasible at all had the significance threshold set at 0.005.

The updated NICE guidelines on cardiac computed tomography as the first-line test for coronary artery disease were based, among other things, on the article by Williams et al. [26] to support recommendations. This randomised trial on 4,146 patients showed that when preventive therapies were implemented ≥ 50 days after computed tomography, the rate of fatal and non-fatal myocardial infarction was halved in the patients allocated to computed tomography (HR = 0.50; p = 0.020) [26]. Being this the result of a subgroup analysis, it could not be found had the significance threshold set at 0.005.

These few examples clearly show the difficulties in reaching statistical significance in clinical research. But similar problems are faced also in preclinical research on animal models, with specific ethical concerns [27]. In particular, the use of animal models should be discouraged and kept to the minimum, a criterion that is in contradiction with the need for a larger sample size following significance threshold reduction.

Beyond the p value: secondary evidence and data sharing

Regardless of the misuse of p value and lack of reproducibility, too much importance is given to the p value threshold rather than to biases as well as selective reporting and non-transparency in published studies. There are many stages from the original idea to data analysis of a study, with the p value being the very last. Decisions that are made prior to discuss the p value have a greater impact on results, including design, lack of adjustment for confounding factors, and simple measurement errors. Biases may force significant findings to come out, with spurious effect sizes that are later rebutted. To a certain degree, biases may lead to a p value lower than any threshold. As acknowledged by Ioannidis [11], malicious researchers would easily avoid the obstacle by defining, perhaps a posteriori, weak surrogate endpoints. Yet, simply reducing the significance threshold probably would not attenuate these problems.

More importantly, healthcare policymakers typically base their decisions on secondary evidence, such as systematic reviews and meta-analyses or cost-effectiveness analyses, which summarise the available evidence taking into consideration the methodological quality of the analysed studies. Systematic reviews “dismiss mostly the noise” referred to by Ioannidis [11]. Policymakers do not usually take decisions based on a single study of few patients reaching p just below the 0.05 threshold. We should never forget that “science is built up of facts, as a house is with stones. But a collection of facts is no more a science than a heap of stones is a house”, as Jules Henri Poincaré said (https://www.brainyquote.com/quotes/henri_poincare_164238).

As said by Steven Goodman, “there is no number generated by standard methods that tells us the probability that a given conclusion is right or wrong”. Efforts should be instead paid to the researchers’ training and to policies that may further improve research quality, such as a priori registration of a trial protocol, the need for a professional statistician for data analysis, and data sharing [6]. Especially, data sharing has the potential for verification by independent authors of the results presented in a given publication [28]. When data are shared, they may be used by other researchers to perform alternative or supplementary analyses. An independent analysis may show results in support of the initial findings or could instead reveal errors or inconsistencies in the original research. Finally, data sharing could potentially lead to an optimisation of time and costs of clinical research by preventing the duplication of trials.

The peculiar case of radiomics

Concerns on the so-called multiple testing burden hold in any research dealing with numerous variables and significance thresholds must be adapted according to established methods. For example, genome-wide association studies (GWAS) are such that millions of statistical tests are typically performed. Keeping the significance threshold at the conventional value of 0.05 would lead to a large number of false-positive results. For example, if 1,000,000 tests are carried out, then 5% of them (that is, 50,000 tests) are expected to lead to p < 0.05 by chance when the null hypothesis is actually true for all these tests. The multiple testing burden has led to the adoption of stringent significance thresholds in GWAS, such as 5 × 10−8, reflecting a Bonferroni correction for the one million independent tests performed in a GWAS [29, 30]. Application of this threshold has increased the robustness and reproducibility of claimed associations [31]. However, it is important to highlight that this lowered threshold actually derives from the conventional 0.05 threshold after the Bonferroni correction [29, 30], not from an arbitrary reduction. Further details can be found in the review by Sham and Purcell [32].

The integration into clinical research of enormous datasets as typically happens for radiomics may determine a situation not different from that of genomics. Radiomics is defined as the extraction of a large number of quantitative features from medical images, with the distinct advantage of assessing tissue heterogeneity. Radiomic features offer quantitative measurements of tissues through three-dimensional images including texture, intensity, heterogeneity, and morphology information allowing a comprehensive phenotype analysis.

Radiomics may involve a high number of statistical tests, with a potential for a high false discovery rate. In a typical radiomic study, hundreds to thousands of features can be extracted to be correlated to different outcomes. The number of features is often greater than the number of analysed patients (the so-called large p, small n problem), leading to false-positive results and overfitting [33, 34]. For these studies, the significance threshold should be corrected for multiple hypotheses testing using the Bonferroni method or a false discovery rate controlling procedure, such as the Benjamini-Hochberg method [34]. Interestingly, searching for the optimal threshold by keeping the false discovery rate at a desired level (e.g., ≤ 5%) or using the Bonferroni method both yield overly conservative values for threshold [35]. Moreover, as the optimal threshold can vary in different datasets, the results may not be generalizable. Especially for survival analysis, selection of an optimal threshold is not recommended [34, 36]. Anyway, regardless of the significance threshold, false discoveries in radiomics can be identified by external validation on at least one large independent dataset, an essential condition for declaring any diagnostic or prognostic value of machine/deep learning applications to clinical imaging [37].

Alternatives to the p value

Several alternatives to the p value have been proposed. A comprehensive discussion is beyond this article’s aim; further details may be found in Wasserstain et al. [6]. Briefly, proposals coming from the frequentist school go all around a hybrid combination of p value, meaningful/clinically important effect size, 95% CIs, and other metrics. On the other side, the Bayesian school offers the use of the so-called Bayesian factors, with inherent strengths and weaknesses. The Bayesian school is based on the Bayes’ theorem (also called theorem of conditioned probability), for which the probability that the result of an examination is associated with the presence or the absence of the disease depends on the pre-examination probability and the power of the examination. As a matter of fact, none of these methods have really seen a widespread use in clinical research.

To this regard, a simulation study [38] has shown some cornerstones, summarised as follows:

  1. 1.

    The p value-only approach to inferential statistics for studies that are properly powered is associated to the highest false discovery rate (63% or 47% when 0.05 or 0.005 are used as the significance threshold, respectively).

  2. 2.

    The effect size-only method falters in terms of false discovery rate (43%) for low-power studies, or when the effect size is comparatively small.

  3. 3.

    A method that combines p values with the effect size can generate heuristic, but non-definitive evidence.

In summary, all of the many methods compared in the study by Goodman et al. [38] have strengths and weaknesses and none of them generates an automatic final answer for a definitive inference. These methods are not so much different to one another and often provide similar interpretations of uncertainty [39]. However, there are cases where these statistical methods do not clearly align. Closer inspection of these cases likely reveals problems with sample size, study design, or implementation of protocols.

An interesting advancement is the so-called second-generation p value, based on an expanded null hypothesis [40]. The idea is to use a composite null hypothesis that takes into consideration the limits of experimental precision in outcome measurement as well as the clinical relevance. The null interval should contain, in addition to the precise point null hypothesis, all other points that are practically/clinically equivalent. An example of interval null hypothesis for an odds ratio (OR) may be H0 0.95 ≤ OR ≤ 1.05 instead of H0 OR = 1, as typically done in clinical research. In other words, rejecting the null hypothesis that two effects are identical is not helpful, because they could still be nearly identical for all practical purposes. Indeed, this method resembles that behind the well-known non-inferiority studies.

What does p .05 significance mean?

A statistically significant test result (P ≤ 0.05) means that the test hypothesis is false or should be rejected. A P value greater than 0.05 means that no effect was observed.

Is 0.05 A significant p

If the p-value is 0.05 or lower, the result is trumpeted as significant, but if it is higher than 0.05, the result is non-significant and tends to be passed over in silence.

What does a .05 level of significance mean quizlet?

a .05 level of significance means that: there is only a 5 percent chance that a statistic's value could be obtained as a result of random error.

When p is less than 0.05 is it significant?

A p-value less than 0.05 is typically considered to be statistically significant, in which case the null hypothesis should be rejected. A p-value greater than 0.05 means that deviation from the null hypothesis is not statistically significant, and the null hypothesis is not rejected.