2  Benchmarks

Keywords

collaboration, confidence interval, effect size, open educational resource, open scholarship, open science

2.1 Introduction to benchmarks

What makes an effect size “large” or “small” is completely dependent on the context of the study in question. However, it can be useful to have some loose criterion in order to guide researchers in effectively communicating effect size estimates. Jacob Cohen (1988), the pioneer of estimation statistics, suggested many conventional benchmarks (i.e., how we refer to an effect size other than using a number) that we currently use. However, Cohen (1988) noted that labels such as “small”, “medium”, and “large” are relative, and in referring to the size of an effect, the discipline, the context of research, as well as the research method and goals, should take precedence over benchmarks any time it’s possible. There are general differences in effect sizes across different disciplines, and within each discipline, effect sizes differ depending on study designs and research methods (Schäfer and Schwarz 2019) and goals; as Glass, McGaw, and Smith (1981) explains:

Depending on what benefits can be achieved at what cost, an effect size of 2.0 might be “poor” and one of .1 might be “good.”

Therefore, it is crucial to recognize that benchmarks are only general guidelines, and importantly, out of context. They also tend to attract controversy (Glass, McGaw, and Smith 1981; Kelley and Preacher 2012; Harrell 2020). Note that field-specific empirical benchmarks have been suggested by researchers. For social psychology, these alternative benchmarks obtained through meta-analyzing the literature (for example, this and this; see this Twitter/X thread for a summary) are typically smaller than what Cohen put forward. Although such field-specific effect size distributions can provide an overview of the observed effect sizes, it does not provide a good interpretation of the magnitude of the effect (see Panzarella, Beribisky, and Cribbie 2021). To examine the magnitude of the effect, the specific context of the study at hand needs to be taken into account (pp. 532-535, Cohen 1988). Please refer to the table below:

Effect Size Reference Small Medium Large
Mean Differences
Cohen’s \(d\) or Hedges’ \(g\) Cohen (1988)1 0.20 0.50 0.80
0.18 0.37 0.60
Lovakov and Agadullina (2021)2 0.15 0.36 0.65
Correlational
Correlation Coefficient (\(r\)) Cohen (1988) .10 .30 .50
Richard, Bond Jr., and Stokes-Zoota (2003)34 .10 .20 .30
Lovakov and Agadullina (2021) .12 .24 .41
Paterson et al. (2016) .12 .20 .31
Bosco et al. (2015) .09 .18 .26
Cohen’s \(f^2\) .02 .25 .40
eta-squared (\(\eta^2\)) Cohen (1988) .01 .06 .14
Cohen’s f Cohen (1988) .10 .25 .40
Categorical
Cohen’s \(w\) Cohen (1988) 0.10 0.30 0.50
Phi Cohen (1988) .10 .30 .50
Cramer’s \(V\) 5
Cohen’s \(h\) Cohen (1988) 0.2 0.5 0.8

It should be noted that small/medium/large effects do not necessarily mean that they have small/medium/large practical implications (for details see, Coe 2012; Pogrow 2019). These benchmarks are more relevant for guiding our expectations. Whether they have practical importance depends on contexts. To assess practical importance, it will always be desirable for standardized effect sizes to be translated to increase/decrease in raw units (or any meaningful units) or a Binomial Effect Size Display (roughly, differences in proportions such as success rate before and after intervention). The reporting of unstandardardized effect sizes is not only beneficial for interpretation but they are also more robust and more easy to compute (Baguley 2009). Additionally, a useful tool to examine, for example, the magnitude of a Cohen’s d is by examining U3, percentage overlap, probability of superiority, and numbers needed to treat (For nice visualizations see https://rpsychologist.com/cohend/, Magnusson 2023).

To further assess the practical importance of observed effect sizes, it is necessary to establish the smallest effect size of interest for each specific field (SESOI, Lakens, Scheel, and Isager 2018). Cohen’s benchmarks, field-specific benchmarks, or published findings are not preferred to establish the SESOI because they do not convey information about the practical relevance/magnitude of an effect size (Panzarella, Beribisky, and Cribbie 2021). Recent developments in various areas of research in psychology have been taken to establish the SESOI through anchor-based methods (Anvari and Lakens 2021), consensus-methods (Riesthuis et al. 2022), and cost-benefit analyses (see Otgaar et al. 2022, 2023). These approaches are frequently implemented successfully in medical research (e.g., HEIJDE et al. 2001) and recommendations are to, ideally, implement the various methods simultaneously to obtain a precise estimate of the smallest effect size of interest (termed minimally clinically important difference in the medical literature, Bonini et al. 2020). Interestingly, the minimally clinically important difference (MCID, smallest effect which patients perceive as beneficial [or harmful], McGlothlin and Lewis 2014) is sometimes even deemed as a low bar and other measures are encouraged such as patient acceptable symptomatic state (PASS, level of symptoms a patients allows while still accept their symptom state, this can be used to examine whether a certain treatment leads to a state that patients consider acceptable, Daste et al. 2022), substantial clinical benefit (SCB, effect that leads patient to self-report significant improvements, Wellington et al. 2023), and maximal outcome improvement (MOI, similar to MCID, PASS, and SCB, except that the scores are normalized by the maximal improvement possible for each patient, Beck et al. 2020; Rossi, Brand, and Lubowitz 2023).

Please also note that only zero means no effect. An effect of the size .01 is an effect, but a very small (Sawilowsky 2009), and likely unimportant one. It makes sense to say that “we failed to find evidence for rejecting the null hypothesis,” or “we found evidence for only a small/little/weak-to-no effect” or “we did not find a meaningful effect”. It does not make sense to say, “we found no effect.” Purely by the random nature of our universe, it is hard to imagine that we can obtain a sharp zero-effect result. This is also related to the crud factor, which refers to the idea that “everything correlates with everything else” (Orben and Lakens 2020, 1; Meehl 1984), but the practical implication of very weak/small correlations between some variables may be limited, and whether the effect is reliably detected depends on statistical power.

2.2 Why Contextualizing Effect Sizes Matters

Interpreting effect sizes is not straightforward. Many researchers default to benchmarks–labeling an effect size as “small,” “medium,” or “large” based on cut-offs (e.g. Cohen’s \(d \approx 0.2, 0.5, 0.8\)) or corresponding correlation values (\(r \approx 0.1, 0.3, 0.5\)).

While Cohen’s effect size conventions are widely taught and convenient, they can be misleading if applied uncritically. Even Cohen himself cautioned that these cut-offs were arbitrary and intended as a last resort in the absence of domain-specific guidance (Cohen, 1988; Lakens, 2013). What qualifies as a “small” effect in one discipline might be average or even large in another. For instance, although r=0.30r = 0.30r=0.30 is classified as a medium correlation by Cohen’s rule of thumb, empirical surveys in applied psychology suggest that typical effects in the field often fall between r=0.20r = 0.20r=0.20 and r=0.30r = 0.30r=0.30, making such values relatively large in context (Richard, Bond, & Stokes-Zoota, 2003; Funder & Ozer, 2019).

Although it can be a useful rule of thumb, solely relying on fixed benchmarks across disciplines risks misrepresenting findings. Many published effects that meet Cohen’s “medium” threshold would be considered large when compared to empirical distributions within their sub-fields. Conversely, effects below Cohen’s d 0.20 (or 0.10 r) can still be meaningful, particularly when aggregated over time or applied broadly (as later sections will illustrate). Ultimately, an effect size’s significance depends on its research context, including measurement scales, base rates, and what is typical for the domain (Hill et al. 2008).

Example 2.1 (Elections) Why a “Small” Effect Can Be a Big Deal To grasp why context is crucial, consider a voter turnout scenario. Suppose a new get-out-the-vote message increases voter turnout by only 2 percentage points among those who receive it. In raw terms, this is a small effect – most people’s behavior didn’t change. Yet in a national election, a 2% swing can decide the winner. A marginal change that seems trivial in percentage terms can have enormous real-world consequences when an outcome (like an election) is on a knife’s edge. Similarly, a medical example: a daily aspirin regimen might only reduce the absolute risk of heart attack by a fraction of a percent for an individual (a tiny effect size, say \(r \approx 0.03\)). But if millions of people take aspirin, thousands of heart attacks could be prevented. What looks “small” by statistical convention can be life-saving at scale. These examples illustrate that magnitude labels (small/medium/large) are not value judgments – a “small” effect can matter a great deal if the context amplifies its impact (large population, repeated over time, high stakes decision), whereas a “large” effect in a trivial context might not matter at all. Throughout this chapter, we will see many such cases where small effects accumulate into big outcomes and why researchers must interpret effect sizes with context in mind.

2.2.1 Lessons from Empirical Benchmarks

Hill, Bloom, Black, and Lipsey (2008) argued that these generic classifications lack empirical grounding and advocate for context-driven benchmarks. Their work exemplifies the need for compelling alternatives: evaluating effect sizes relative to normative expectations, policy-relevant gaps, and observed results from similar interventions.

2.2.1.1 Normative Expectations for Growth

One approach to contextualizing effect sizes is by comparing them to expected growth in the absence of an intervention. Using nationally normed standardized test data, the authors examined average student learning gains across K–12 education. The results highlight how expected academic progress varies significantly by grade level:

  • Grade 1–2: Average annual gain of \(d=0.97\) (reading) and \(d=1.03\) (math).
  • Grade 5–6: Gains decline to \(d=0.32\) (reading) and \(d = 0.41\) (math).
  • Grade 11–12: Minimal expected gains of \(d=0.06\) (reading) and \(d=0.01\) (math).

This empirical benchmark suggests that an intervention producing may be negligible in early grades but relatively substantial in high school. Contextualizing effect sizes within natural developmental trajectories is crucial for assessing their substantive significance.

2.2.1.2 Policy-Relevant Performance Gaps

Another way to interpret effect sizes is by comparing them to existing disparities in educational outcomes. Using National Assessment of Educational Progress (NAEP) data, the authors quantified achievement gaps by race, socioeconomic status, and gender:

  • Black-White gap in reading: \(d=−0.83\) (Grade 4), \(d=−0.67\) (Grade 12).
  • SES gap (free vs. reduced-price lunch eligibility) in math: \(d=−0.85\) (Grade 4).
  • Gender gap in reading: \(d=−0.18\) (Grade 4), \(d=−0.44\) (Grade 12).

If an educational intervention produces an effect of \(d=0.10\), it must be interpreted relative to these gaps. A \(d=0.10\) effect would have little impact in closing the Black-White achievement gap but could significantly influence gender-based disparities in literacy.

2.2.1.3 Observed Effect Sizes from Similar Interventions

Finally, the authors suggest using historical effect sizes from randomized controlled trials (RCTs) and meta-analyses as a benchmark for assessing new interventions. Their synthesis of 61 RCTs found that:

  • Elementary school interventions typically yield \(d=0.33\).
  • Middle school interventions average \(d=0.51\).
  • High school interventions produce smaller effects, averaging \(d=0.27\).

A separate meta-analysis of 76 meta-analyses found that across all grades, effect sizes cluster between \(d=0.20\) and \(d=0.30\). This suggests that any new intervention achieving an effect size within this range is performing on par with prior research, while larger effects may indicate an exceptionally successful intervention.

2.2.2 Implications for Effect Size Interpretation

Hill et al. (2008)’s framework underscores the importance of contextualizing effect sizes within empirical reality rather than relying on arbitrary thresholds. Their approach offers three key takeaways:

  1. Compare intervention effects to natural growth rates—an impact in early grades may be trivial, while the same effect in high school is more meaningful.
  2. Evaluate effect sizes against real-world disparities—a policy-relevant benchmark (e.g., racial achievement gaps) provides a clearer sense of whether an intervention is impactful.
  3. Situate new findings within prior research—meta-analytic evidence helps gauge whether an observed effect is typical or remarkable within a given field.

By embedding effect sizes in empirical benchmarks, researchers can move beyond rigid classifications and provide a more nuanced, context-sensitive interpretation of intervention impacts.


  1. Sawilowsky (2009) expanded Cohen’s benchmarks to include very small effects (\(d\) = 0.01), very large effects (\(d\) = 1.20), and huge effects (\(d\) = 2.0). It has to be noted that very large and huge effects are very rare in experimental social psychology.↩︎

  2. According to this recent meta-analysis on the effect sizes in social psychology studies, “It is recommended that correlation coefficients of .1, .25, and .40 and Hedges’ \(g\) (or Cohen’s \(d\)) of 0.15, 0.40, and 0.70 should be interpreted as small, medium, and large effects for studies in social psychology.↩︎

  3. Note, for paired samples, this does not refer to the probability of an increase/decrease in paired samples but rather the probability of a randomly sampled value of X. This is also referred to as the “relative” effect in the literature. Therefore, the results will differ from the concordance probability provided below.↩︎

  4. These benchmarks are also recommended by Gignac and Szodorai (2016). Funder and Ozer (2019) expanded them to also include very small effects (\(r\) = .05) and very large effects (\(r\) = .40 or greater). According to them, […] an effect-size \(r\) of .05 indicates an effect that is very small for the explanation of single events but potentially consequential in the not-very-long run, an effect-size r of .10 indicates an effect that is still small at the level of single events but potentially more ultimately consequential, an effect-size \(r\) of .20 indicates a medium effect that is of some explanatory and practical use even in the short run and therefore even more important, and an effect-size \(r\) of .30 indicates a large effect that is potentially powerful in both the short and the long run. A very large effect size (r = .40 or greater) in the context of psychological research is likely to be a gross overestimate that will rarely be found in a large sample or in a replication.” But see here for controversies with this paper.↩︎

  5. The benchmarks for Cramer’s V are dependent on the size of the contingency table on which the effect is calculated. According to Cohen, use benchmarks for phi coefficient divided by the square root of the smaller dimension minus 1. For example, a medium effect for a Cramer’s V from a 4 by 3 table would be .3 / sqrt(3 - 1) = .21.↩︎