# What is the p-value when testing hypotheses

## The interpretation of the *p*-Value - basic misunderstandings

The *p*Value is widely regarded as the gold standard for inferential conclusions. For the validation of statistical relationships, the convention has emerged, as low as possible *p*-Values to be requested and for values below certain thresholds (e.g. 0.05) of *statistically significant* Talk about results. Often the *p*Value also as *Probability of error* designated. Both terms are problematic because they encourage misunderstandings. In addition, the so-called. *p-hacking*, d. H. the targeted search for evaluations that lead to statistically significant results, cause bias and the rate of false discoveries (*false discovery rate*) can increase. Misinterpretations of the *p*-Value and evaluation-related biases have been repeatedly and critically discussed over the decades. In empirical research, however, they seem to be persistent and in recent years the *p*-The value debate is becoming increasingly intense because of the non-reproducibility of many studies. In view of the literature on the *p*-Value problem, this method comment systematically describes the most important problems and discusses the corresponding proposed solutions.

The *p*-value is often considered as the gold standard in inferential statistics. The standard approach for evaluating empirical evidence is to equate low *p*-values with a high degree of credibility and to refer to findings with *p*-values below certain thresholds (e.g., 0.05) as *statistically significant*. The *p*-value is also referred to as *error probability*. Both terms are problematic as they invite serious misconceptions. In addition, researchers ’fixation on obtaining statistically significant results may introduce biases and increase the *rate of false discoveries*. Misinterpretations of the *p*-value as well as the introduction of bias through arbitrary analytical choices (*p-hacking*) have been critically discussed in the literature for decades. Nonetheless, they seem to persist in empirical research and criticisms of inappropriate approaches have increased in the recent past - mainly due to the non-replicability of many studies. Unfortunately, the critical concerns that have been raised in the literature are not only scattered over many academic disciplines but often also linguistically confusing and differing in their main reasons for criticisms. Against this background, our methodological comment systematizes the most serious flaws and discusses suggestions of how best to prevent future misuses.

### 1 Introduction

The *p*Value is widely regarded as the gold standard for inferential conclusions. *p*Values are generally understood as an aid to avoid errors of type I. This is understood to mean the conclusion that an effect is there even though it is not there. In the case of statistical evaluations, the convention has emerged, the lowest possible *p*-Values to demand and for values below certain thresholds (e.g. 0.05) of *statistically significant* Talk about results. Often the *p*Value also as *Probability of error* designated. Both terms are problematic because they encourage misunderstandings.

First, a semantic misunderstanding arises when the term “significant” is equated with “large / important”. Second, fallacies can arise if, when interpreting statistically insignificant results, formulations are used that disregard the principle of the excluded third party and suggest a confirmation of the null hypothesis (no effect). Thirdly, there is a risk that researchers will be hindered by the so-called *p-hacking* Produce distortions and only publish what has "worked" in terms of producing significant results. ^{[1]} Fourth, the term “probability of error” encourages semantic misinterpretation, the *p*-Value denote the as *false discovery rate* denotes the probability of making a mistake if one rejects the null hypothesis. ^{[2]}

The above problems have been discussed again and again for decades. This is especially true for medical professionals and psychologists. ^{[3]} Over the past decade, the problems have become increasingly perceived. ^{[4]} In addition to misinterpretations, this is due, among other things, to the *p-hacking* related problem of non-reproducibility (replication crisis). In a drastic response to that *p*- The editors of the journal have had a value crisis *Basic and Applied Social Psychology* In early 2015, the use of *p*Values in publications completely prohibited (Trafimow / Marks 2015). ^{[5]} This ban as well as a multitude *p*-Value of critical articles in high-ranking journals up to *Nature* (see Nuzzo 2014) and *ScienceNews* (cf. Siegfried 2014) have generally increased awareness of the problem in the empirical sciences. In early March 2016, the American Statistical Association (ASA) even issued an official statement on how *p*-Value-related errors are to be avoided (Wasserstein / Lazar 2016). Interestingly, in economics the reception and participation in the debate seems to be rather weak. Significant exceptions are the contributions by Ziliak and McCloskey (2008) and Krämer (2011), which are widely used *p*-Value of related misinterpretations in contributions of the *American Economic Review* or. *German Economic Review* document. About the reasons for the low reception of the *p*-Value debate in economics can only be speculated. In view of the literature that is scattered across the disciplines and often focuses on individual aspects, a systematic overview of the problem complex may be missing in many cases. There may also be fundamental deficits in training. ^{[6]} Against this background, we want to present the most important of the problems discussed in the literature systematically and clearly in this methodological comment.

### 2 problems and possible solutions

### 2.1 Incorrect equation of "significant" with "large"

**Problem Description**: Low *p*Values are usually assigned the term “statistically significant”. A semantic misinterpretation occurs when this term is colloquially interpreted as a synonym for “large / important”. The danger is particularly high if the adjective “statistical” is omitted and only “significant” and “non-significant” effects are spoken of. As a result, one often finds formulations that prove a significant result in comparison to a non-significant result with the adjective “stronger” or “more”. That is wrong (Motulsky 2014). ^{[7]} If a variable *X* a statistically significant influence (effect) on a variable *Y* does not mean that it is a large or important influence. Rather, it means that the probability is low that the observed result would appear as an incidental finding if there were no effect. “Significant” means that there is a low probability of finding the effect in the data if it is not present at all.

Although large samples are often perceived as advantageous without further reflection, equating (statistically) “significant” and “important” is a problem, especially with very large samples. This is because the *p*Values c.p. with increasing *N* sink. That is, any effect, no matter how meaningless (small) it is, will be at high *N* at some point statistically significant. A mini-effect that is meaningless in terms of content, however, never becomes an important effect even with large samples (Wasserstein / Lazar 2016).

**Solution approach**: In order to avoid linguistic misinterpretations, Armstrong (2007), Colquhoun (2014) and Motulsky (2014) suggest completely avoiding the word “significant” in scientific publications. Given the long tradition of the term, it is questionable whether this is enforceable. It may be more practicable to address this problem intensively in teaching and to systematically encourage young researchers to (i) not use the word "significant" as a synonym for "large / important" and (ii) always use it with the addition of "statistical", if there is a risk of misunderstanding (Mittag / Thompson 2000). At journal level, the reviewers could be explicitly asked to pay attention to this misinterpretation and to correct problematic formulations. In connection with the obligation (e.g. in the guidelines of the journals) to discuss the effect size, this would be a step forward that, according to Goodman (2008), could be achieved with little effort. As recommended by the American Psychological Association (APA 2010), the guidelines could also require the use of confidence intervals when dealing with meaningful units of measurement. By specifying the confidence interval, the range of the effect size is communicated in an easily comprehensible form, without having to forego the significance information.

### 2.2 False conclusions when the significance level is exceeded

**Problem Description**: As part of a regression, the coefficients of the regressors are usually checked for significance. In a linear regression, the denotes *p*-Value of a coefficient, for example, the probability of finding the influence expressed by the coefficient (or an even greater one) as a random finding if it were not there at all. Compliance with a significance level of 0.05 is often accepted as a criterion for rejecting the null hypothesis. If one follows that, the question arises how *p*- Values above 5% (not statistically significant results) are to be interpreted. Here, too, there are occasional mistakes in thinking in which linguistic formulations play a role. At a *p*-Value above the significance level of 0.05, the correct content and linguistically unambiguous formulation is as follows:

The null hypothesis that the regressor X

_{1}has no influence on Y cannot be rejected with the usually required level of significance of a maximum of 0.05.

This formulation corresponds to the "sentence about the excluded third party" (*tertium non datur*), according to which a statement must be formulated in such a way that either it applies to itself or its negation. The statement “Hans is either blonde or not blonde” is therefore correct. The statement "Hans is either blond or black-haired" (or analogously: "If Hans is not blond, he is black-haired"), on the other hand, is a violation of the sentence about the excluded third party. A false dichotomy is made that ignores the fact that there can be something third, namely that Hans has a different hair color than blonde or black. An analogous fallacy threatens with the interpretation of *p*Values above the accepted level of significance when using lax but widespread formulations such as the following:

The influence of X on Y is

notstatisticallysignorable. [n-s-s] The influence of X on Y isstatisticallynotsignorable. [s-n-s] The influence of X on Y isnotsignorable. [n-s]

From the last formulation, which already suggests that it has been found that there is no (relevant) effect, it is only a short way to the wrong conclusion:

Our study shows that a (relevant) influence of X on Y

nis not available. [n]

It was correct that the null hypothesis (no effect) could not be rejected with the usually required level of significance of 0.05. However, the conclusion that the null hypothesis has been confirmed is incorrect (Sedlmeier / Gigerenzer 1989; Wasserstein / Lazar 2016). The danger of this fallacy arises when linguistic formulations what one consists of *p*-Values below and above 0.05 can derive the wrong dichotomy "*either* Rejection of the null hypothesis *or* Adoption of the null hypothesis ”. To interpret results that are not statistically significant as confirmation of the null hypothesis is a fallacy that can also be found in formulations where it is not obvious at first glance. For example, non-significant results are occasionally commented on as “being in contradiction to theoretical predictions” that suggest the existence of the effect. That is an inadmissible conclusion. You could only say that if you *p*Values above the significance level as confirmation of the null hypothesis. ^{[8]}

**Solution approach**: Since the reasoning error described is easy to penetrate on the basis of logic, students and young scientists should be specifically familiarized with the principle of the excluded third party. At journal level, the reviewers should consistently object to all linguistic formulations used in *p* > 0.05 suggest the fallacy of “confirmation of the null hypothesis”. Since this fallacy can be easily identified, it is easy to ensure a strict standard in the scientific review process. Outside of science, however, false dichotomies can be a virulent problem. When it comes to the reception of research results by the interested public or political advice, it is often difficult to convey to users (specialist journalists, politicians) that "*no significant effect*“Does not mean that you have found an indication that there is no effect (or only an insignificant effect). The particular problem with the public reception of research results is possibly that an interesting-sounding (albeit incorrect) message is in the fight for public perception *"X has no influence on Y!"* Many participants prefer the "boring" message that they cannot make a statement. In particular, when study results are received incorrectly in the context of important public debates, researchers repeatedly have to face the task of correcting hasty interpretations.

### 2.3 *p-hacking*

**Problem Description**: There are a large number of "design options" in statistical analyzes that can lead to distortions and ultimately wrong conclusions. ^{[9]} Are such interventions carried out on the part of the researchers in a targeted manner in order to create "publishable" *p*-Values to achieve lies *p-hacking* in front. ^{[10]} Since the selection of data and analysis methods in the research process are seldom clear decisions, is *p-hacking* difficult to identify. A substantiated selection of the most informative data material and the most adequate analysis method for the research question does not constitute *p-hacking* A selective selection and presentation of those analysis variants that are not communicated transparently in terms of production *p*Values "works" best in comparison to other variants is against it *p-hacking*. It is therefore difficult for outsiders to distinguish whether a certain approach prevents or downright causes distortion. So it makes sense to correct obviously nonsensical values (e.g. a car fuel consumption of 95 l / 100 km) or to remove the relevant data set completely from the data set due to a lack of reliability. If, on the other hand, one deliberately removes the 10% lowest and / or highest observation values from the data set and then looks whether a statistically significant result can be produced mathematically, it is *p-hacking*. The targeted search for evaluation methods that lead to the desired statistically significant results is not perceived as a problem by many researchers, although every selective representation of an evaluation that “works” corresponds to the production of a bias (Simmons et al. 2011). Figure 1 gives an impression of the “design options” for statistical analyzes in the context of *p-hacking* can be used.

### Illustration 1:

Different ways of *p*-*hacking*.

- a)
*Reduction in sample size that was not objectively justified*: There are two starting points for the sample size*p-hacking*. Firstly, as already mentioned, you can clean up the sample for "outliers" and try out how the*p*-Values thereby change.^{[11]}Second, you can especially at high*N*try out how separate analyzes of data subsets can be applied to the*p*Values. If 20 subgroups are analyzed separately, it is almost to be expected that even a non-existent effect will be shown to be significant once purely by chance. If you selectively report this exact result, you have a severe case of*p-hacking*.^{[12]} - b)
*Transformation of the data that is not factually justified*: Even if the sample size is fixed, you can*p-hacking*operate and try out whether the*p*- Decrease values if the data is transformed in any way. This includes reducing the scale level (e.g. income classes instead of income) and creating new variables, e.g. in the form of relative values (e.g. weight divided by height). In principle, the following also applies in this context: Each of these measures can be justified in terms of content. The targeted search for and the selective identification of the variant that produces significant results to the desired extent leads to an overestimation of the empirical evidence. - c)
*Incorrectly justified inclusion / removal of variables*: There are also “design options” when selecting the variables that are taken into account in the estimation model. This applies first of all to the control variables, the number and type of which can be changed in order to see which variable set can best achieve the desired statistical significance. The possibility of expanding / reducing the set of variables or of substituting certain variables with others is also available with the manifest variables, with which the latent variables (theoretical constructs) are operationalized within the framework of a hypothesis-driven approach. Imagine, for example, that you want to check whether attitudes towards organic agriculture influence the willingness to pay for organic products. The attitude (= latent variable) was recorded in a survey using various questions (items). If you try until you have found an item for the setting that gives a significant result in the model, you have a distortion.^{[13]}However, this does not become obvious if, for marketing reasons, only the evaluation that could be used to produce a significant result is published. - d)
*Use of statistical tests and estimation models that are not objectively justified*: There is also room for maneuver when testing distribution assumptions and deciding on an estimation model*p-hacking*can be abused. Imagine that it is not clear beforehand whether to use a simple OLS estimate or a panel data model.*p-hacking*becomes out of it, if one tries out both estimation models and then selectively represents the one with which the desired significance comes out best. The transparency in the scientific communication process is lost if the different models are not explicitly compared and discussed. In other words:*p-hacking*the selection of tests and estimation models also leads to a distortion and inflation of the empirical evidence. - e)
*Increase in sample size that is not objectively justified*: Similar to the search for another model that “works”, the effect of the search for a larger data set that may “work” if the original data set did not yield “satisfactory” significance. Suppose you did an economic experiment with the original sample size*N*no statistically significant results found. In such a case, it is often perceived as unproblematic to enlarge the sample ad hoc and then, if necessary, to publish the significant results obtained in the enlarged sample (Motulsky 2014). The problem is that this also produces a bias and overestimates the empirical evidence, since one only carries out subsequent surveys if one has not found any significance in the original data set.

When evaluating the *p-hacking* The problem of asking whether you are dealing with an exploratory study to generate hypotheses or a confirmatory study to test hypotheses. An exploratory and discovering search for correlative connections that should enable the generation of hypotheses is a sensible and indispensable step in the research process. *p-hacking* can thus be understood as a problem that arises when exploratory and confirmatory data analysis are not clearly distinguished from one another. A review of hypotheses must be carried out with new data; and the use of terms such as “hypotheses*test*"And" statistically significant "should be used for exploratory approaches, which are only initial indications for the *education* of hypotheses can be avoided (Gigerenzer / Marewski 2015). This would be in line with Fisher's dictum that a minor one *p*- Value actually only means "worthy a second look" (Nuzzo 2014: 151) and an*demonstrate* should see whether further studies are worthwhile.

**Solution approach**: Dealing with *p-hacking* is difficult, because it is not about mistakes in reasoning, but about careless handling of good scientific practice or scientific misconduct in the research process itself. In addition to raising the awareness of students and young researchers, various requirements for publication practice are discussed. (i) A first requirement is that it should be stated explicitly for every investigation whether it is an exploratory-discovering study to identify correlative relationships and to *Generation* of hypotheses or to conduct a study *Verification* is about hypotheses. The two must not be mixed (Marino 2014; Motulsky 2014). A further requirement would be to require an internal replication with new data for every study. (ii) A second requirement is not only to make all raw data accessible at journal level, but also to provide precise and transparent documentation and publication of all work steps (including recoding and transformation of the data) (Simmons et al. 2011) or even before To carry out a study to register the research design and all data material. This suggestion raises the question of who will or can take the time to look through and check the material on file. (iii) A third requirement is that the authors make an explicit one *no-p-hacking* Demand an explanation (Simmons et al. 2012). It is hoped that this will strengthen the norm appeal of good scientific practice. The problem with this is that while there are extreme approaches that are clear *p-hacking* but the selection of analysis methods is often not a matter of content-related imperative decisions. It is therefore difficult to compile a catalog of “indexed” approaches. In view of the system-related pressure to publish, it is also questionable whether an increased standard roll call is sufficient to solve the problem. (iv) A fourth demand ties in with the general discussion on biases in scientific publication practice and demands that contributions with negative results and replication studies should be given a higher scientific status and given a chance for publication. ^{[14]}

### 2.4 Equating the "probability of error" with the false discovery rate

**Problem Description**: Another reason for the replication crisis is seen in the fact that the *p*-Value *as such* is often misinterpreted and understood as the probability of the null hypothesis. ^{[15]} In other words: In addition to the confusion between “significant” and “important”, there is another semantic misunderstanding triggered by the convention, the *p*-Value to be referred to as "probability of error" or "probability of a type 1 error". Despite this naming, the *p*-Value *Not* those here as *false discovery rate* Described probability of committing an error in the form of an error of type I if the null hypothesis is rejected. The *p*-Value is calculated as a conditional probability under the *adoption*that the null hypothesis is correct. A conclusion about that *probability* the null hypothesis can be derived from the *p*-Do not pull the value (Kline 2013; Nuzzo 2014). That is why the statement that *p*Values can be used to test hypotheses only partially correct. Despite the term “hypothesis test”, you also test *p*- Do not evaluate hypotheses, but rather *p*Values show how compatible the data are with the statistical model specified by the null hypothesis (Wasserstein / Lazar 2016).

The facts can be illustrated with an example tossing a coin, in which a manipulated coin [*P (head)* = 0.75] and with 99% probability a non-manipulated coin [*P (head)* = 0.5] pulls. Now you toss the coin five times and watch your head five times. If the coin were ideal (= no effect), with many repetitions of the experiment “five coin tosses” would only be 3.125% (= 0.5^{5}) of the cases five heads to be expected. This conditional probability, also known as the false positive rate, corresponds to this *p*-Value. However, it is not the probability of the null hypothesis “ideal coin” and thus also not the probability of making a mistake if the null hypothesis is rejected. For this you also have to know how high the probability of the manipulated coin is five heads. This probability, also known as the right-positive rate or power, is 23.73% (= 0.75^{5}). You also have to take into account the a priori probabilities 1% and 99%, also known as “prior”, that you had initially drawn a manipulated or an ideal coin. According to Bayes' theorem, one arrives at what is also known as the posterior probability *false discovery rate* of 92.88% [= 0.03125 · 0.99 / (0.03125 · 0.99 + 0.237 · 0.01)] to commit a mistake if one rejects the null hypothesis “ideal coin”. Despite the low *p*Values, one will not reject the null hypothesis. The informational content of the data obtained by the throwing experiment only leads to the fact that the a priori probability of 99% is revised and a posteriori (i.e. *to* Evaluation of the experimental data) only assumes with a 92.88% probability that one is dealing with an ideal coin.

To avoid misunderstandings, which are understandably due to the misleading but common name of the *p*-Values can arise as "probability of error" or as "probability of a type 1 error", the following must be recorded:

Despite its name, the "probability of error"

*Not*the probability of making a mistake by rejecting the null hypothesis. In other words: the*p*Value, although it is also referred to as the "probability of a type 1 error"*Not*the really interesting one*false discovery rate*, d. H. the a posteriori probability resulting from the analysis of making a type I error if the zero is rejected.For determining the

*false discovery rate*is required in addition to what is also known as the false-positive rate*p*-Value the right-positive rate or power of a concrete alternative hypothesis; and for the alternative hypothesis and the null hypothesis one needs probability information from outside the sample in the form of the priors. Without power and prior, the determination is the*false discovery rate*basically*Not*possible (Motulsky 2014). Is WP Engine a good host- Which social media are the most dangerous?
- You can have shoes dry-cleaned
- What's something that sucks about your car
- What are your favorite hand tool hacks
- Dogs understand the value of money
- How can pictures symbolically represent a career
- How can I conduct corporate trainings
- Narcissist knows they love to bomb
- What do you think of Bhojpuri songs
- Which bat does Ben Stokes use
- Why can't you answer a paradox
- What is the MGF of the normal distribution
- What are the 10 sales models
- Does jogging work against thin legs?
- What does everything around you mean
- How to say whistleblower in Portuguese
- What is the value of 81 1
- What are most of the ghetto rap songs
- How can I earn $ 60 per day
- Do B2B companies need an app
- The Anthony Joshua fight was canceled early
- How much is a quarter of 1982 worth
- What are some inexpensive watches
- Is there an order of precedence for R packets
- What are the most popular escort sites
- Can Freon catch fire
- Can one man love more than one woman
- What is the best business strategy
- Why do we need smart grids
- Why are red grapes filled with fiber
- How do people behave less intelligently?
- What are the Most Influential K12 Websites