Damian Sendler Psychology’s Theory Crisis Must Be Addressed

Damian Sendler There are a large number of psychological findings that cannot be repeated. The methods of data collection, analysis, and reporting have dominated diagnoses and solutions to this “replication crisis.” We contend that theories and empirical tests often have weak logical links, which contributes to their low replicability. Discovery-oriented research and theory-testing research should be distinguished. There are no strict hypotheses that can be tested in discovery-oriented research, but rather theories define a search area for the discovery of effects that support them. The theory is not questioned by the absence of these effects. Type I errors—publishing findings that cannot be replicated—are inevitable in this endeavor. In contrast, research that tests hypotheses relies on theories that strongly imply hypotheses, so that evidence against the theory is provided by the disconfirming of the hypothesis. Type I errors are less likely to occur in theory-testing research. When theories are formalized as computational models, they form a strong link with hypotheses. It’s time to revisit some of the ideas put forth to deal with the “replication crisis,” including the idea of preregistering hypotheses and analysis plans.

Damian Jacob Sendler There’s a problem in psychology. Several previously accepted findings have been shown to be falsifiable over the last decade (Marsman et al., 2017; Open Science Collaboration, 2015). Asendorpf et al. (2013) and Munaf et al. (2017) have made numerous recommendations on how to deal with this “replication crisis” (Asendorpf et al. 2013). Our data collection, analysis, and publication methods are the focus of most of these recommendations. Replication crises are exacerbated by the prevalence of theories that have only a weak logical connection to the hypotheses they test empirically, according to this paper. When it comes to dealing with the current replication crisis, we believe that theories should take precedence over data generation, and we don’t think current recommendations are sufficient.

We use theories (T) to derive hypotheses—often referred to as predictions1—that claim that some empirical generalizations X, Y, or Z are true. This in turn allows inferences about theories to be made, supporting them if they are consistent with hypothesised outcomes and casting doubt on those that aren’t.

Dr. Sendler It’s perhaps not surprising that much of the discussion on the root causes of the replication crisis and how to fix it focuses on formalizing inductive inference at the empirical level in psychology: There have been several reasons given for the lack of trustworthiness of the empirical generalizations we draw from data, including studies with insufficient power (Button et al., 2013), problems with null-hypothesis significance testing (Wagenmakers, 2007), p-hacking (Simmons, Nelson, and Simonsohn, 2011), and publication bias (Ferguson & Heene, 2012). Those critical reflections on this part of the scientific reasoning cycle are extremely valuable. At the same time, we contend that the theoretical weaknesses of our inferences have gotten far too little attention (for two commendable exceptions, see Fiedler, 2017; Muthukrishna & Henrich, 2019). We contend that the replication crisis is exacerbated by theoretical flaws.

Our goal in this paper is to shed light on the process steps involved in obtaining results that cannot be reproduced. The first step is to identify the problem. Assume that certain types of phenomena can be observed under certain conditions, and then develop a theory to support that assumption. Our case can be illustrated by a theory of embodiment priming, for example. The following core assumptions are incorporated into the theory: This means that all concepts are based on bodily states or sensations or movements. (1) Experimenting with the bodily states, sensations, or movements that underlie a given concept activates (primes) that concept. (3) The concept that has been activated has an effect on the behavior that is associated with it. 2 For this theory, it’s hypothesized that the concepts primed by a state, sensation or movement can influence the way people perceive the world around them.

In order to test this hypothesis empirically, we must scour a vast amount of data. It is possible to ground abstract concepts in bodily states, sensations, and movements, and for each such presumed embodiment there are numerous ways in which it can be experimentally induced. In addition, there are judgments or decisions that could be influenced by each abstract concept, which could be investigated for the anticipated bias. In order to test the embodiment priming theory, all of these possibilities must be taken into account (EPT). To test the hypothesis that having people turn kitchen-paper rolls clockwise activates their future orientation, priming the concept of novelty so that they are more likely to score higher on the personality scale “openness to experience,” researchers could experiment with this (Topolinski & Sparenberg, 2012). People who are asked to hold a cup of hot coffee for just a few seconds could be asked to rate another person as more “warm” than those who hold a cup of cold coffee (Williams & Bargh, 2008).

If the theory is correct, it does not imply that the predicted bias will be present in every possible test in that space—it only predicts that it is present in a small subset of possible tests. The theory is supported by evidence showing the predicted bias in each test, but the theory is refuted by evidence showing that the particular combination of a concept, its assumed grounding, the chosen manipulation, and the chosen judgment is not an informative test of the theory. If this is the case, researchers should inquire as to “what went wrong” with their study instead of revising the theory: This can be due to a variety of factors, including an incorrect assumption about how the concept in question is embodied or a failure to elicit the relevant bodily state. In any case, such a failure can be written off as uninformative by the researcher and they can move on to another area of the search space.

Were we to dismiss this type of research because the examples cited above (Lynott et al., 2014) and Wagenmakers et al. (2015) did not hold up in replications, we would be wrong. Searching for exoplanets, finding new drugs, or even finding the neural correlates of a psychological phenomenon are all examples of research programs that follow the same principle. In this case, we can refer to it as “discovery research.” 3 The second step in the sequence is what causes this type of research to yield nonreplicable results. Use the conventional statistical inference standards (e.g., a p value.05; Pashler & Harris, 2012) to evaluate the evidence from each test and conduct an extensive search through the vast space of possible tests. For theory testing research, the inferential standards were designed to be used in a different way.

We categorize the two types of research in order to clarify their differences. X is a testable hypothesis (i.e., the proposition that a particular experimental effect or correlation exists in the population) and “x” is evidence from an individual study supporting hypothesis X, with T denoting the underlying theory and X denoting a testable generalization (e.g., an experiment yielding a significant effect as expected from X).

Damian Jacob Sendler

Think about theory-testing research at this point. This type of study begins with a theory that provides clear inferences from which hypotheses can be derived. Theory-testing research necessitates theories that can be tested, whereas theories that guide discovery necessitate theories that can be tested. Think of a temporal-context theory like SIMPLE for episodic memory (Brown, Neath, & Chater, 2007). Extending the (filled or unfilled) delay between encoding and retrieval reduces a memory list’s temporal distinctiveness, which means that accurate retrieval is less likely. 4 In the case of SIMPLE, which formalizes these assumptions by a set of equations, these assumptions can be mathematically derived from the core assumptions of temporal-context theories. As a result of this strong logical link between theory and hypothesis, establishing X as an empirical generalization supports theory T, and proving X to be false is evidence against T. (e.g., see Lewandowsky, Duncan, & Brown, 2004, for evidence against the prediction from SIMPLE mentioned above). As a result, theory testing is the name given to this type of research: You can get strong evidence for and against a theory using this method.

Conventional criteria of evidence for an effect, such as 0.05 and 1 to 0.8, result in an acceptable false-positive rate in theory-testing research, but an unacceptable false-positive rate in discovery-oriented research, as shown in this example. False positives, on the other hand, are almost always impossible to reproduce. Some of the replication problems in psychology are due to the fact that many studies in the field are discovery-oriented, but the evidentiary criteria used in those studies are better suited for theory testing than discovery-oriented research.

There are many good reasons why reducing the possibility of Type I errors, or “false positives,” is a current hot topic in discovery-oriented research to address the “replication crisis.” It is important for theory-testing researchers to make the most of the data they have at hand. Furthermore, they should be as interested in proving the hypothesis to be false as they are in proving it to be true. When a hypothesis is disproven, theory-testing research provides evidence that the theory in question is false (right panel in Fig. 2). Using this high disconfirmatory diagnosticity, we need methods to prove that a hypothesis is incorrect. When it comes to proving the null hypothesis, null-hypothesis significance testing does not have the necessary tools.

Another common suggestion for dealing with the replication crisis is to make a clear distinction between exploratory and confirmatory research, with the understanding that only the latter can provide strong evidence for a hypothesis… (Wagenmakers, et al., 2018a; Wagenmakers et al., 2012). This recommendation is often accompanied by a call to pre-register hypotheses and analysis plans. ‘ Although our distinction between discovery-oriented and theory-testing research is similar, the exploratory-confirmatory contrast is usually defined in a different and, we argue, unhelpful way, we believe the distinction is still relevant. Only if the hypotheses and data analysis plan are established before the data is analyzed are the results considered confirmatory, whereas hypotheses and data analysis decisions that are made after the data has been analyzed are considered exploratory. Confirmation bias can occur in exploratory research, whether the researcher intends it or not. Scientists aiming to discover reportable effects may be tempted to test hypotheses or analyze data in a way that confirms an expected effect when the data inform which hypothesis to test or which combination of data transformation and statistical analysis procedure to use It’s a common criticism of the practice of HARKing, which is the practice of using post-hoc hypotheses to make pre-existing hypotheses appear to be pre-existing hypotheses (Kerr, 1998).

Damian Jacob Sendler This is a valid criticism. In our opinion, its focus on the chronological order in which a researcher specifies and executes their hypothesis and analysis plan, as well as the data they interrogate, is counterproductive (see also Rubin, 2017b, for a similar critique). Reversing the order in which hypotheses and analysis plans are fixed before analyzing data is considered bad practice. Some believe that preregistration is the best solution to this issue because it ensures that all of these steps are performed in the correct order. As a result, this approach to defining the problem and its solution remains superficial because it uses an intangible distinction (temporal order) as a proxy for an intangible one (the distinction between justified and arbitrary hypotheses and analysis procedures).

The paradox of predictivism in philosophy of science refers to the debate over whether or not the order of events in the universe has any bearing on the veracity of scientific hypotheses or theories (Barnes, 2008). When two strong intuitions conflict, the result is a conundrum. When a theory is based on the prediction of a new discovery, it receives more support than when the theory is based on the explanation of an existing discovery. Secondly, the value of a finding as evidence for a theory should not be dependent on historical accidents, such as when a theorist first learned of an empirical finding and when she first thought of a theory that predicts or explains that finding. According to the philosophy of science, the idea that a particular piece of empirical evidence supports a given theory to a certain degree should be rejected.

Is preregistration necessary? In order to limit “researcher degrees of freedom”—that is, researchers’ choices among large sets of equally defensible hypotheses to test and analysis plans to test them—preregistration serves as a means. Null-hypothesis testing, the dominant statistical approach in psychology, has been argued to have an uncontrolled increase in Type I error rate due to multiple testing (de Groot, 1956/2014) in the classical framework. When researchers choose hypotheses or analysis paths that lead to a desired outcome (e.g., data preprocessing and statistical model choices), they open the door to inadvertent biases (Wagenmakers et al., 2012).

Preregistration serves an important role in preventing a number of fallacies in scientific inference by limiting researcher freedom. When preregistration is applied mechanically—by preregistering hypotheses and analysis plans with little regard to their justification—we believe it remains a cure for the symptoms rather than a solution to the root problem. This raises the question: Where do the excessive degrees of researcher freedom come from, and can we do anything to reduce them systematically rather than through an arbitrarily chosen decision that we privilege by uploading it on a preregistration repository?

Damian Jacob Markiewicz Sendler There are degrees of freedom for researchers at both the scientific inference level and the conclusion level (see Fig. 1). There are numerous data transformations and data analysis tools at our disposal, and many of them can be justified, on the empirical level. It occurs when theories can be used to support a wide range of hypotheses, even if some of these hypotheses are contradictory. It is our contention that preregistration of an arbitrary choice is less principled on both levels, but the solutions to each level are distinct, so we will discuss them separately. There is a problem with Type I error inflation in null-hypothesis testing, and we begin with this issue. Researchers have a lot of freedom in their research, which can lead to inadvertent bias. We’ll talk about this more general problem next. There are two ways to solve this problem: first, we’ll look at the data-analysis options available, and then we’ll look at the hypotheses to test.

Multiple testing can be broken down into two categories (Rubin, 2017a) that correspond to our two levels of inference: empirical and theoretical. When a researcher conducts multiple analyses on the same hypothesis in the hopes of finding a statistically significant result, this is known as Case 1. As a matter of fact, this case is about the abuse of empirical researcher freedoms. To avoid “p-hacking,” it inflates the Type I error rate for the hypothesis being investigated. This issue is not resolved by using Bayesian statistics in place of null hypothesis testing: Bias is inevitable when a researcher runs multiple analyses on the same hypothesis and chooses the one that yields the highest Bayes factor for one’s preferred hypothesis. This bias can be avoided by preregistration of analysis plans, which reduces the number of tests to just one in both statistical approaches

The best solution is to run all equally justifiable analyses and record the extent to which they all lead to the same results if there are multiple paths that are equally justifiable for achieving a statistical inference goal (e.g. testing a given hypothesis). One can still run a sample of different analysis plans if the number of options is too large to run them all (similar to a sensitivity analysis; Thabane et al., 2013). Consistency is an important consideration when evaluating the robustness of analytic choices that shouldn’t matter. The “multiverse” analysis of Steegen, Tuerlinckx, Gelman, and Vanpaemel (2016), which examined the robustness of inferences from a data set across a variety of data preprocessing decisions, is a good illustration of this approach. We believe that this strategy should be applied to all analysis-related decisions (e.g., concerning outlier treatment, statistical model to be tested, inclusion of independent variables and covariates). Using a multiverse analysis, the researcher can avoid Type I error inflation by drawing conclusions not just from a single test, but from all tests that were run.

Damian Sendler

Preregistering one analysis plan and relying solely on it as strong, “confirmatory” evidence is a far cry from the gold standard (Nosek, Ebersole, DeHaven, & Mellor, 2018). Preregistration is arbitrary if there are multiple analysis plans that are equally valid before looking at the data. An interesting pattern could be missed by, say, 90% of all other equally justifiable data analysis plans if this one method of analyzing data is used. We’ll never know if we stick to the preregistered strategy. If we deviate from that strategy, we’d be entering “exploratory” territory, and the results would have less weight in the eyes of researchers who value preregistration as a sign of quality; in fact, exploratory analyses have no evidential value in relation to the hypothesis under investigation within the framework of null-hypothesis testing (de Groot, 1956/2014).

Researchers’ degrees of freedom must be reduced at both the inference and conclusion levels of science. It is recommended that researchers test their inferences empirically by running them through a variety of analysis decisions, each equally justifiable (see Lewandowsky & Bishop, 2016 for boundary conditions), and making their raw data publicly available whenever possible (see Lewandowsky & Bishop, 2016 for boundary conditions). Theoretically, there are two ways to reduce the degree of freedom that researchers have. Take the first route, but make sure to do everything correctly. As a result, researchers who follow this route acknowledge that their hypotheses have a lower prior probability because the current state of theory in their field does not permit strong inferences to them. To establish a new empirical generalization with a sufficient level of credibility, large sample sizes and/or direct replications are required. Theory-testing research is another option. An important goal for those who pursue theories in this manner is to ensure that the theory and any hypotheses derived from it are linked as closely as possible. This is likely to be aided by formally expressing the theory. Direct replication is less important in this path, and subsequent tests of different hypotheses are preferred.

Although some subdisciplines of psychology have a more established tradition of formal modeling than others, theory-testing research may appear to be out of reach for many psychologists. A precise theory that allows strong inferences to hypotheses can be difficult to formulate. While this is true, we contend that researchers can always take steps to formalize their theoretical ideas. Formal models exist at various levels of abstraction, so they are not required to describe mechanisms and processes in great detail in a formal theory. A Bayesian network (Glymour, 2003) or a path diagram could be used to make explicit the monotonic causal links between two or more continuous variables, or the probabilistic dependencies between discrete variables. Assumed moderator variables, boundary conditions, and other auxiliary assumptions would have to be explicitly included in such a model. There is so much uncertainty about these additional assumptions that theorists often hesitate to make them explicit because it would come down to arbitrary guesswork. Uncertainty should be incorporated into models in these situations. To achieve this, the Bayesian modeling framework is the most appropriate choice: Using priors, we can introduce an element of ambiguity. To express our level of uncertainty about a given quantity, priors are typically applied to free parameters with a wide range of possible values. We can also use priors to express uncertainty about different model choices, such as the choice between various functional forms for the relation between two variables (i.e., the relation could be linear, or exponential, or a power function). A probability distribution over a set of discrete options in the model would serve as the prior. Model assumptions can be expressed explicitly and formally, implying that uncertainty is not the same as vagueness regarding model assumptions. That doesn’t make it acceptable for the latter to be the case.

Dr. Sendler

Sendler Damian Jacob

Damian Jacob Markiewicz Sendler

Share:FacebookX