2  Benchmarks

What makes an effect size “large” or “small” is completely dependent on the context of the study in question. However, it can be useful to have some loose criterion in order to guide researchers in effectively communicating effect size estimates. Jacob Cohen (1988), the pioneer of estimation statistics, suggested many conventional benchmarks (i.e., how we refer to an effect size other than using a number) that we currently use. However, Cohen (1988) noted that labels such as “small”, “medium”, and “large” are relative, and in referring to the size of an effect, the discipline, the context of research, as well as the research method and goals, should take precedence over benchmarks any time it’s possible. There are general differences in effect sizes across different disciplines, and within each discipline, effect sizes differ depending on study designs and research methods (Schäfer and Schwarz 2019) and goals; as Glass, McGaw, and Smith (1981) explains:

Depending on what benefits can be achieved at what cost, an effect size of 2.0 might be “poor” and one of .1 might be “good.”

Therefore, it is crucial to recognize that benchmarks are only general guidelines, and importantly, out of context. They also tend to attract controversy (Glass, McGaw, and Smith 1981; Kelley and Preacher 2012; Harrell 2020). Note that field-specific empirical benchmarks have been suggested by researchers. For social psychology, these alternative benchmarks obtained through meta-analyzing the literature (for example, this and this; see this Twitter/X thread for a summary) are typically smaller than what Cohen put forward. Although such field-specific effect size distributions can provide an overview of the observed effect sizes, it does not provide a good interpretation of the magnitude of the effect (see Panzarella, Beribisky, and Cribbie 2021). To examine the magnitude of the effect, the specific context of the study at hand needs to be taken into account (pp. 532-535, Cohen 1988). Please refer to the table below:

Effect Size Reference Small Medium Large
Mean Differences
Cohen’s \(d\) or Hedges’ \(g\) Cohen (1988)1 0.20 0.50 0.80
0.18 0.37 0.60
Lovakov and Agadullina (2021)2 0.15 0.36 0.65
Correlation Coefficient (\(r\)) Cohen (1988) .10 .30 .50
Richard, Bond Jr., and Stokes-Zoota (2003)34 .10 .20 .30
Lovakov and Agadullina (2021) .12 .24 .41
Paterson et al. (2016) .12 .20 .31
Bosco et al. (2015) .09 .18 .26
Cohen’s \(f^2\) .02 .25 .40
eta-squared (\(\eta^2\)) Cohen (1988) .01 .06 .14
Cohen’s f Cohen (1988) .10 .25 .40
Cohen’s \(w\) Cohen (1988) 0.10 0.30 0.50
Phi Cohen (1988) .10 .30 .50
Cramer’s \(V\) 5
Cohen’s \(h\) Cohen (1988) 0.2 0.5 0.8

It should be noted that small/medium/large effects do not necessarily mean that they have small/medium/large practical implications (for details see, Coe 2012; Pogrow 2019). These benchmarks are more relevant for guiding our expectations. Whether they have practical importance depends on contexts. To assess practical importance, it will always be desirable for standardized effect sizes to be translated to increase/decrease in raw units (or any meaningful units) or a Binomial Effect Size Display (roughly, differences in proportions such as success rate before and after intervention). The reporting of unstandardardized effect sizes is not only beneficial for interpretation but they are also more robust and more easy to compute (Baguley 2009). Additionally, a useful tool to examine, for example, the magnitude of a Cohen’s d is by examining U3, percentage overlap, probability of superiority, and numbers needed to treat (For nice visualizations see https://rpsychologist.com/cohend/, Magnusson 2023).

To further assess the practical importance of observed effect sizes, it is necessary to establish the smallest effect size of interest for each specific field (SESOI, Lakens, Scheel, and Isager 2018). Cohen’s benchmarks, field-specific benchmarks, or published findings are not preferred to establish the SESOI because they do not convey information about the practical relevance/magnitude of an effect size (Panzarella, Beribisky, and Cribbie 2021). Recent developments in various areas of research in psychology have been taken to establish the SESOI through anchor-based methods (Anvari and Lakens 2021), consensus-methods (Riesthuis et al. 2022), and cost-benefit analyses (see Otgaar et al. 2022, 2023). These approaches are frequently implemented successfully in medical research (e.g., HEIJDE et al. 2001) and recommendations are to, ideally, implement the various methods simultaneously to obtain a precise estimate of the smallest effect size of interest (termed minimally clinically important difference in the medical literature, Bonini et al. 2020). Interestingly, the minimally clinically important difference (MCID, smallest effect which patients perceive as beneficial [or harmful], McGlothlin and Lewis 2014) is sometimes even deemed as a low bar and other measures are encouraged such as patient acceptable symptomatic state (PASS, level of symptoms a patients allows while still accept their symptom state, this can be used to examine whether a certain treatment leads to a state that patients consider acceptable, Daste et al. 2022), substantial clinical benefit (SCB, effect that leads patient to self-report significant improvements, Wellington et al. 2023), and maximal outcome improvement (MOI, similar to MCID, PASS, and SCB, except that the scores are normalized by the maximal improvement possible for each patient, Beck et al. 2020; Rossi, Brand, and Lubowitz 2023).

Please also note that only zero means no effect. An effect of the size .01 is an effect, but a very small (Sawilowsky 2009), and likely unimportant one. It makes sense to say that “we failed to find evidence for rejecting the null hypothesis,” or “we found evidence for only a small/little/weak-to-no effect” or “we did not find a meaningful effect”. It does not make sense to say, “we found no effect.” Purely by the random nature of our universe, it is hard to imagine that we can obtain a sharp zero-effect result. This is also related to the crud factor, which refers to the idea that “everything correlates with everything else” (Orben and Lakens 2020, 1; Meehl 1984), but the practical implication of very weak/small correlations between some variables may be limited, and whether the effect is reliably detected depends on statistical power.

  1. Sawilowsky (2009) expanded Cohen’s benchmarks to include very small effects (\(d\) = 0.01), very large effects (\(d\) = 1.20), and huge effects (\(d\) = 2.0). It has to be noted that very large and huge effects are very rare in experimental social psychology.↩︎

  2. According to this recent meta-analysis on the effect sizes in social psychology studies, “It is recommended that correlation coefficients of .1, .25, and .40 and Hedges’ \(g\) (or Cohen’s \(d\)) of 0.15, 0.40, and 0.70 should be interpreted as small, medium, and large effects for studies in social psychology.↩︎

  3. Note, for paired samples, this does not refer to the probability of an increase/decrease in paired samples but rather the probability of a randomly sampled value of X. This is also referred to as the “relative” effect in the literature. Therefore, the results will differ from the concordance probability provided below.↩︎

  4. These benchmarks are also recommended by Gignac and Szodorai (2016). Funder and Ozer (2019) expanded them to also include very small effects (\(r\) = .05) and very large effects (\(r\) = .40 or greater). According to them, […] an effect-size \(r\) of .05 indicates an effect that is very small for the explanation of single events but potentially consequential in the not-very-long run, an effect-size r of .10 indicates an effect that is still small at the level of single events but potentially more ultimately consequential, an effect-size \(r\) of .20 indicates a medium effect that is of some explanatory and practical use even in the short run and therefore even more important, and an effect-size \(r\) of .30 indicates a large effect that is potentially powerful in both the short and the long run. A very large effect size (r = .40 or greater) in the context of psychological research is likely to be a gross overestimate that will rarely be found in a large sample or in a replication.” But see here for controversies with this paper.↩︎

  5. The benchmarks for Cramer’s V are dependent on the size of the contingency table on which the effect is calculated. According to Cohen, use benchmarks for phi coefficient divided by the square root of the smaller dimension minus 1. For example, a medium effect for a Cramer’s V from a 4 by 3 table would be .3 / sqrt(3 - 1) = .21.↩︎