Significance Testing in Natural and Biological Systems: A Review

It is generally accepted that the reproducibility crisis in the fields of natural and biological sciences is in part due to the misuse of the Null Hypothesis Significance Testing (NHST). We review the shortcomings in the use of NHST and then go beyond these to consider additional issues. Many natural systems are time-varying and some are scale-free, which requires design of new methods for such cases. We also consider the problem from the perspective of information efficiency and since three-way logic is superior to two-way logic, we argue that adding a third hypothesis may be beneficial in certain applications.


Introduction
The basis of human judgment is rational thinking within the framework of chosen beliefs, and it is clear that the judgment can be no better than the beliefs. One might think that data-based decisions, as in statistical reasoning or machine intelligence, do not suffer from shortcomings related to beliefs, but, in reality, that is not so. Seemingly neutral datadriven methods that are routinely used in many fields can lead to false judgments if used inappropriately.
The dominant statistical method in biomedical, social science and psychological research is the null hypothesis significance testing (NHST), which scholars believe is partly responsible for the replication crisis of natural systems as in social science, psychology, cognitive neuroscience, and biomedical science [1][2] [3], but it continues to be the default method [4][5] [6].
The Open Science Collaboration (2015) replicated 100 landmark studies in the field of psychology, and in this set less than half yielded results sufficiently similar to the claims. Though 97% of the original studies produced statistically significant results, only 36% of the replication studies did so [7]. Efforts to replicate research in other fields such as behavioral economics [8], medicine [9], genetics [10] and neuroscience [11] likewise produce poor results. A survey published in Nature in 2016 reported that researchers had been unable to reproduce over 70% of the findings of other scientists [12]. Just a few years' prior, researchers at the biotechnology firm Amgen reported only 11% of the preclinical cancer studies could be replicated [13].
It has been shown that it is surprisingly simple to increase the probability of obtaining positive results for false hypotheses, using research practices that are considered conventional. One can be easily led to confirmation of the hypothesis and "[i]n many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not." [14] In 2016, the American Statistical Association issued a statement of principles regarding the misuses of and misinterpretations of NHST [15], stressing that "p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. Neither do they measure the size of an effect or the importance of a result, or provide a good measure of evidence regarding a model or hypothesis." But the problem is more complicated than just the use of NHST, which assumes population properties remain unchanging. In reality, diseases and physiological processes have periodicities [15], the immune function has seasonal changes [16] [17], and pathogens also peak at different times of the year [18]. Furthermore, many biological and natural processes are scale-invariant [19] [20][21] [22] and so they do not follow the normal distribution that is implicitly assumed in many hypothesis tests which requites the use of maximum-likelihood methods and other tests [23]. This article reviews these issues that should be helpful in formulating new tests for hypothesis verification in systems with varying characteristics. It reviews the mathematical basis of the result that three-way logic is superior to two-way logic [24][25] [26] and, by extension, testing three hypotheses is better than testing two, which provides the rationale for adding a third hypothesis to the experiment, which may be beneficial in certain natural systems applications quite like it is done for many engineering applications.

Background
There are complex social reasons why popular significance tests continue to be used in spite of their well-known shortcomings. Amongst these reasons is that the investigators want quick confirmation of their hypothesis using a method that is widely used, which is consistent with the wish of the sponsors to monetize the claimed innovation as soon as possible.
Marcia Angell, who was for two decades the editor of The New England Journal of Medicine, said [27]: "It is simply no longer possible to believe much of the clinical research that is published, or to rely on the judgment of trusted physicians or authoritative medical guidelines." She believes that the drug companies are mainly responsible for this situation: "Over the past two decades the pharmaceutical industry has moved very far from its original high purpose of discovering and producing useful new drugs. Now primarily a marketing machine to sell drugs of dubious benefit, this industry uses its wealth and power to co-opt every institution that might stand in its way, including the US Congress, the FDA, academic medical centers, and the medical profession itself." [28] More recently, Richard Horton, editor of The Lancet, wrote that "The case against [medical] science is straightforward: much of the scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken a turn towards darkness." [29] It is generally agreed that the biggest reason behind irreproducible research is bad design and inherent biases. In a well-known paper, John Ioannidis claimed that most published results are false [30], and they merely present the investigator's biases in a dressed-up manner. The way significance testing is used, it is hard to separate the good findings from the bad, which has implications for the usefulness of clinical research. "[E]xamination of the meta-analyses appearing in Psychological Bulletin from 1978 to 2006 shows that most employ a statistically inappropriate model for meta-analysis (the fixed effects model) and that 90% do not correct for the biasing effects of measurement error." [31] To deal with inherent biases, of which some researchers may not even be consciously aware, there is need for new kind of design that approaches the hypothesis from different perspectives. It has been suggested that there should be a clear identification of the underlying uncertainty model associated with the scientific study [32] [33][34] [35] and some have even argued that NHST should be abandoned [36] [37].

Data and hypothesis testing
A statistical hypothesis is a conjecture concerning the unknown probability distribution associated with the observed data X. The significance test is used to check the tenability of the hypotheses. Generally, the method requires reducing the data to a single numerical statistic T whose marginal probability distribution is closely connected to a main question of interest.
Before the start of the experiment, two hypotheses are defined: 1. Null Hypothesis, H0: there is no significant effect 2. Alternative Hypothesis, Ha: there is some significant effect The data is used to calculate a p-value that is then compared with the critical value , which is set before starting the experiment. If the p-value is less than the critical value, then the effect is deemed significant and the null hypothesis is rejected. If p-value is more than the critical value, it is concluded that there is no significant effect and the null hypothesis is not rejected.
There are four possible outcomes, with two representing correct decisions and two representing errors. The likelihood that a test will be able to detect a property in the data depends on the strength of that property in the population. In a research project, the investigator does not know this strength, for the estimation of this value may itself be one of the purposes of the study. Instead, the investigator must choose the size of the sample so as to be able to detect the property in the sample. In other words, the size of the sample is often correlated with the presence of absence of the effect.
One can reduce the risk of committing a type I error by using a lower value for α. However, by doing so one will increase the probability of false negatives (a type II error).
The probability of making a type II error is β, and this is related to the power of the statistical test (power = 1-β). One can decrease the risk of a type II error by ensuring the test has enough power by ensuring that the sample size is large enough.
The working of the test is clear if the probabilities associated with the population are known (as would be the case in an engineered system, but it is very unlikely in a biological or social system). Thus in a hypothesis about the mean of a population, one can choose a sample size n and find the sample mean and calculate the sample standard deviation if the population standard deviation is unknown. The population can be normal distributed with known or unknown variance in which cases n can be small, or it may not be normal with known or unknown variance in which case n must be large (typically ≥ 30). Clearly, one needs a larger n if the distribution is unknown.
One speaks of three hypotheses.
Reject H0 if p-value < α As example, let there be a sample size of n =16 to determine if the sample is representative of the larger population (H0) which has the mean of 100. Formally, H0: μ=100, which means that we must consider the p-value to be 2 ( > | |), that is, one integrates both sides of the tail of the distribution. The larger population is known to have a standard deviation of 16. The sample has a mean of 108 and we use α = 0.05 (which is a popular choice in many social science experiments): The p-value corresponding to Zc=2 is the area under the normal distribution to the right of Z =2 (Figure 1). This probability of P(Zc≥ 2) may be read off from a table and it equals 0.023. Since 2×0.023 is less than α = 0.05, the null hypothesis may be rejected. This means that the sample is not representative of the larger population.
On the other hand, if the sample size was 8, Zc= 108−100 16/√8 =1.414. The p-value corresponding to this is 0.078, two times which is larger than 0.05, and so the null hypothesis will be accepted this time. The likelihood of a mean of 108 in a smaller sample of 8 is more likely (within the significance level of α =0.05) than in the larger sample of 16 for in the smaller sample a value far from the mean can change the sample mean much more easily than in a larger sample. In case, the hypothesis is defined in terms of just greater than or less than the statistic, only one side of the tail of the standard normal distribution will have to be integrated.
This example shows how the choice of the specific members of the test sample can lead to vastly varying results. In an experiment where the investigator is looking for a new effect as represented by the experimental finding, there will be a tendency to privilege readings with the property over those that do not satisfy it. In other words, the hypothesis may be considered confirmed by the investigator's bias in favor of it and by casting out readings that go against the hypothesis.
If the statistics are not normal, then one needs a larger sample. For unknown distributions, one can use tight bounds such as the Hoeffding's inequality or variations thereof that provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount [38]. There will also be more complex situations where the elements switch characteristics [39], or where the underlying phenomenon is fractal or fractal-like [40]. Such cases will require investigation within the context of the application.

The power law
If the statistics are according to a power law, which is encountered often in natural systems, then also one needs to define the hypotheses in a different way with the corresponding significance test.
Experiments have shown (e.g. [23] [25]) that many phenomena follow the following power law approximately for large values of x: where is a parameter whose value is typically in the range 2 < < 3, and a is a constant that is needed to normalize the distribution. The main characteristic of this distribution is its heavy-tailed nature but the value for small x can have considerable variation with measurable effects (Figure 2). The distribution could be discrete or continuous.
For an experiment, the hypothesis could be whether the sample belongs to the green ( = 2.7), the blue ( = 2.9), or the orange ( = 2.1) populations. In the case of a mixed population where all these samples are present in known or unknown proportions, the null hypothesis could, for example, be: Is the sample predominantly green? In this case, of course, the term "predominant" will have to be suitably defined.
For the case, where there is a associated with the distribution, the power law may be written as below after finding out the value of a to normalize the distribution: Its moment E(X m ) is easily found to be: It is clear that all moments for which ≥ − 1 diverge. Specifically, This power law has a well-defined mean only if > 2, and it has a finite variance only if > 3. This means that one cannot apply traditional statistics based on standard deviation.
Another specific distribution with the exponent nearly close to 1 is the Zipf's law for discrete variable k, where N is the number of elements, and k is the rank:.
( ; , ) = 1/ ∑ 1/ 1 (6) In hypotheses related to power law one could, in principle, determine modes of behavior correlated with the exponent. For reliable estimate of the exponent of the power law distribution, maximum likelihood methods are used.

Multiple hypotheses
There is a deeper problem with hypothesis testing that requires attention. This is the question related to the nature of the population on which the null and the alternative hypotheses are being framed.
Binary hypotheses seem intuitive and natural: a person has a specific disease or does not have it. Populations are described in binary fashion in terms of gender or as adults versus children, although this can be enlarged to three or more classes. On the other hand, age is a continuous variable that can be mapped into hypotheses that are non-binary, where one might be looking for effects that are age-specific.
If one were speaking of dealing with a disease, in the general case, one can divide a sample of subjects into at least three classes: 1. Those who respond to the drug versus those who don't 2. Those who respond to placebo versus those who don't 3. Those who get well on their own versus those who don't These classes may be seen as arising out of the mind-body connection [41] [42] and the unknown aspects of the working of the immune system that lead to the placebo and nocebo effects [43][44] [45].
Let us label the chosen hypotheses appropriate for the population under study as 1, 2, 3, … d. If the hypotheses are well defined, then from considerations related to maximization of entropy, their individual probabilities should be equal to 1/ . Therefore, the information associated with each hypothesis is ln . This argument has been generalized from the problem of dimensions where it has recently been used with surprising results [46][47] [48].
Clearly, this information increases as d increases. But this increase must be balanced against the cost of the use of the larger count of the hypotheses. Information efficiency per hypothesis is: Its maximum value is obtained by taking the derivative of ( ) and equating that to zero. This yields = = 2.71828. . .. In the consideration of hypotheses (as against dimensions of space where noninteger dimensions are mathematically possible) we can only count in integers. In other words: Theorem. The optimal number of hypotheses based on information efficiency considerations is 3. Table 1 gives the value of E(d) in bits (where the measure is 2 ) for d ranging from 2 to 5.  Use of three classes, rather than two, improves the information efficiency by a value of 5.6 percent.
It must be stressed that since the result is based on information, it can only be probabilistically true. What that means is that not all experiments will be better off with three hypotheses, and it is quite possible that a specific experiment will do better with two. This means that considerations of design will have to go into deciding what is the best course of action, unless mathematical criteria can be determined that lead to this judgment.
As an aside, it is significant that a ternary classification of patients as well as processes is used in at least one medicine system (Ayurveda) [49][50] [51]. A genomic correlation with the ternary classification has been demonstrated for that system [52]. The threehypotheses approach may be seen to work in different ways. It could be yes/no/maybe or yes/no/indeterminate for a pathology or in terms of some other trichotomy.
Obviously, multiple hypotheses can present challenges of design. One will have to examine the analytical implications of modifying the regime of two testing classes to three. Various questions about this third class may be asked. What are the conditions under which the use of the third class is justified? The third class may make it easy to triangulate the study so that biases are minimized.
As an example, if seasonality of pathogens is a factor in an investigation, then in addition to groups of subjects who get the drug under study, and those that get the placebo, there could also be a third group that get neither the drug nor the placebo. This will make the seasonal relationship an important parameter related to the investigation.
In some three-hypothesis test problems, the null hypothesis may be regarding the medial result. In textile engineering, the three-hypothesis test problem could be to decide whether the difference of strength amongst two yarns is zero (the null hypothesis), positive or negative [53]. Although the null hypothesis is what one seeks here, the two alternatives provide information on the risks and costs. Sequential probability tests, based on Bayesian optimality and generalized likelihood ratios have been devised [54][55] [56] that may be useful in certain applications.

Conclusions
This paper is a review of issues with significance testing for natural systems, such as those in biological and social science applications. In experiments where previously unknown effects the distribution related to the effect is unknown, the investigator's confirmation bias can easily affect null hypothesis significance testing and, therefore, its use can lead to erroneous results.
Most methods assume that the data is representative of stable characteristics associated with the population, but this may not be true as in the case of seasonal variation of pathogens or the immune response. Issues with significance testing for power law distributions were described. The paper also looked at the problem of hypothesis testing from the perspective of information efficiency and argued that at least for some problems the use of a third hypothesis may be called for since three-way logic is superior to twoway logic. How these three hypotheses may be defined would depend on the nature of the problem and would require further theoretical and experimental investigations.