Which of the following is a correct interpretation of a p-value that is not very small?

Because of your comments I will make two separate sections:

p-values

In statistical hypothesis testing you can find 'statistical evidence' for the alternative hypothesis; As I explained in What follows if we fail to reject the null hypothesis?, it is similar to 'proof by contradiction' in mathematics.

So if we want to find 'statistical evidence' then we assume the opposite, which we denote $H_0$ of what we try to proof which we call $H_1$. After this we draw a sample, and from the sample we compute a so-called test-statistic (e.g. a t-value in a t-test).

Then, as we assume that $H_0$ is true and that our sample is randomly drawn from the distribution under $H_0$, we can compute the probability of observing values that exceed or equal the value derived from our (random) sample. This probability is called the p-value.

If this value is ''small enough'', i.e. smaller than the significance level thase we have choosen, then we reject $H_0$ and we consider to $H_1$ is 'statistically proven'.

Several things are important in this way of doing:

  • we have derived probabilities under the assumption that $H_0$ is true
  • we have taken a random sample from the distrubtion that was assumed under $H_0$
  • we decide to have found evidence for $H_1$ if the test-statistic derived from the random sample has a low probability of being exceeded. So it is not impossible that it is exceeded while $H_0$ is true and in these cases we make a type I error.

So what is a type I error: a type I error is made when the sample, randomly drawn from $H_0$, leads to the conclusion that $H_0$ is false while in reality it is true.

Note that this implies that a p-value is not the probability of a type I error. Indeed, a type I error is a wrong decision by the test and the decision can only be made by comparing the p-value to the choosen significance level, with a p-value alone one can not make a decision, it is only after comparing the p-value to the choosen significance level that a decision is made, and as long as no decision is made, type I error is not even defined.

What then is the p-value ? The potentially wrong rejection of $H_0$ is due to the fact that we draw a random sample under $H_0$, so it could be that we have ''bad luck'' by drawing the sample, and that this ''bad luck'' leads to a false rejection of $H_0$. So the p-value (although this is not fully correct) is more like the probability of drawing a ''bad sample''. The correct interpretation of the p-value is that it is the probability that the test-statistic exceeds or equals the value of the test-statistic derived from a randomly drawn sample under $H_0$


False discovery rate (FDR)

As explained above, each time the null hypothesis is rejected, one considers this as 'statistical evidence' for $H_1$. So we have found new scientific knowledge, therefore it is called a discovery. Also explained above is that we can make false discoveries (i.e. falsely rejecting $H_0$) when we make a type I error. In that case we have a false belief of a scientific truth. We only want to discover really true things and therefore one tries to keep the false discoveries to a minimum, i.e. one will control for a type I error. It is not so hard to see that the probability of a type I error is the chosen significance level $\alpha$. So in order to control for type I errors, one fixes an $\alpha$-level reflecting your willingness to accept ''false evidence''.

Intuitively, this means that if we draw a huge number of samples, and with each sample we perform the test, then a fraction $\alpha$ of these tests will lead to a wrong conclusion. It is important to note that we're 'averaging over many samples'; so same test, many samples.

If we use the same sample to do many different tests then we have a multiple testing error (see my anser on Family-wise error boundary: Does re-using data sets on different studies of independent questions lead to multiple testing problems?). In that case one can control the $\alpha$ inflation using techniques to control the family-wise error rate (FWER), like e.g. a Bonferroni correction.

A different approach than FWER is to control the false discovery rate (FDR). In that case one controls the number of false discoveries (FD) among all discoveries (D), so one controls $\frac{FD}{D}$, D is the number of rejected $H_0$.

So the type I error probability has to do with executing the same test on many different samples. For a huge number of samples the type I error probability will converge to the number of samples leading to a false rejection divided by the total number of samples drawn.

The FDR has to do with many tests on the same sample and for a huge number of tests it will converge to the number of tests where a type I error is made (i.e. the number of false discoveries) divided by total the number of rejections of $H_0$ (i.e. the total number of discoveries).

Note that, comparing the two paragraphs above:

  1. The context is different; one test and many samples versus many tests and one sample.
  2. The denominator for computing the type I error probability is clearly different from the denominator for computing the FDR. The numerators are similar in a way, but have a different context.

The FDR tells you that, if you perform many tests on the same sample and you find 1000 discoveries (i.e. rejections of $H_0$) then with an FDR of 0.38 you will have $0.38 \times 1000$ false discoveries.

Which of the following is the correct interpretation of the p

The p-value is the largest probability of obtaining test results at least as extreme as the results observed, under the assumption that the null hypothesis is correct. Here we observed the result of a 100% difference between the two versions. So, the observed result is 100%.

Is p

If the p-value is 0.05 or lower, the result is trumpeted as significant, but if it is higher than 0.05, the result is non-significant and tends to be passed over in silence.

What does p

A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis.

What does a large p

High p-values indicate that your evidence is not strong enough to suggest an effect exists in the population. An effect might exist but it's possible that the effect size is too small, the sample size is too small, or there is too much variability for the hypothesis test to detect it.