7 Mean Differences

Keywords

collaboration, confidence interval, effect size, open educational resource, open scholarship, open science

T-tests are the most commonly used statistical tests for examining differences between group means or examining a group mean against a constant. Calculating effect sizes for t-tests are fairly straightforward. Nonetheless, there are cases where statistical information for the calculation of effect sizes are missing (which happens quite often in older articles), and therefore we document methods that make use of partial information (e.g., only the mean and standard deviation, or only the t-statistic and degrees of freedom) for the calculation. There are multiple types of effect sizes used to calculate standardized mean differences (\(d\)), yet researchers very often do not identify which type of \(d\) value they are reporting (see Lakens 2013). Here we document the equations and code necessary for calculating each type of \(d\) value compiled across multiple sources (Becker 1988; Cohen 1988; Lakens 2013; Caldwell 2022; Glass, McGaw, and Smith 1981). A \(d\) value calculated from a sample will also contain sampling error, therefore we will also include equations to calculate the standard error. The standard error allows us to then calculate the confidence interval. For each variant of \(d\) in the sections below, the 95% confidence interval is calculated in the following way, that is,

\[ CI_d = d \pm 1.96\times SE \tag{7.1}\]

Lastly, we will supply example R code so you can apply to your own data.

Here is a table for every effect size discussed in this chapter:

Type	Description	Section
Single Group Design		Section 7.2
\(d_s\) - Single Group	Standardized mean difference for comparing a single group to some constant	Section 7.2
Two Independent Groups Design		Section 7.3
\(d_p\) - Pooled Standard Deviation	Uses the average within-group standard deviation to standardize the mean difference. Can be calculated directly from a independent sample t-test. Assumes homogeneity of variance between groups.	Section 7.3.1
\(d_\Delta\) - Control Group Standard Deviation	Uses the standard deviation of the control group to standardize the mean difference (often referred to as Glass’s Delta). Does not assume homogeneity of variance between treatment/intervention and control group.	Section 7.3.2
Repeated Measures (Paired Groups) Design		Section 7.4
\(d_z\) - Difference score standard deviation	Uses the standard deviation of difference scores (also known as change scores) to standardize the within person mean difference (i.e., pre/post change).	Section 7.4.1
\(d_{rm}\) - Repeated measures	Uses the within-person standard deviation that utilizes a correction to \(d_z\) to reduce the impact of the pre/post correlation on the effect size. Assumes homogeneity of variance between conditions.	Section 7.4.2
\(d_{av}\) - Average variance	Uses the pooled variance between conditions (pre/post test). Does not use the correlation between conditions. Assumes homogeneity of variance between conditions.	Section 7.4.3
\(d_{b}\) - Becker’s d	Uses the pre-test standard deviation to standardize the pre/post mean difference. Does not assume homogeneity of variance between pre-test and post-test.	Section 7.4.4
Pre-Post-Control Design		Section 7.5
\(d_{PPC1}\) - Separate pre-test standard deviations	Defined as the difference between the Becker’s d between the treatment and control group. Particularly, standardizing the mean pre/post change by the pre-test of the respective group.	Section 7.5.1
\(d_{PPC2}\) - Pooled pre-test standard deviation	Standardizes the difference in mean changes between treatment and control group. Assumes homogeneity of variance between the pre-test of the control and treatment condition.	Section 7.5.2
\(d_{PPC3}\) - Pooled pre-test and post-test standard deviation	Pools the standard deviation between pre-test and post-test in treatment and control condition. Assumes homogeneity of variance between pre/post-test scores and treatment and control conditions. Confidence intervals are not easy to compute.	Section 7.5.3
Mean Ratios		Section 9.2.8
\(lnRR_\text{ind}\) - Response ratio between independent groups	The ratio between the means between two groups. Does not use the standard deviation in the effect size formula.	Section 7.7.1
\(lnRR_\text{dep}\) - Response ratio between dependent groups	The ratio between the means between conditions (i.e., repeated measures). Does not use the standard deviation in the effect size formula.	Section 7.7.2

7.1 Reporting a t-test with effect size and CI

Whatever effect size you choose to report, you can report it alongside the t-test statistics (i.e., t-value and the p value). For example,

The treatment group had a significantly higher mean than the control group (t = 2.76, p = .009, n = 35, d = 0.47, 95% CI [0.11, 0.81]).

7.2 Single Group Designs

For a single-group design, we want to compare the mean of that group to some constant, \(C\) (i.e., a target value). The standardized mean difference for a single group can be calculated by (equation 2.3.3, Cohen 1988),

\[ d_s = \frac{M-C}{S}, \tag{7.2}\]

where the standardizer (\(S\)) is the sample standard deviation. The interpretation of \(d_s\) is therefore how many standard deviations is the mean away from the target value, \(C\). A positive \(d_s\) value would indicate that the mean is larger than the target value \(C\), whereas a negative \(d_s\) value would denote the mean is less than the \(C\). The corresponding standard error for \(d_s\) can then be calculated with (see documentation for Caldwell 2022),

\[ SE_{d_s} = \sqrt{\frac{1}{n}+\frac{d_s^2}{2n}}, \tag{7.3}\]

where \(n\) denotes the sample size. In R, we can use the d.single.t() function from the MOTE package (Buchanan et al. 2019) to calculate the single group standardized mean difference.

# Install packages if not already installed:
# install.packages('MOTE')
# Cohen's d for one group

# For example:
# Sample Mean = 30.4, SD = 22.53, N = 96
# Target Value, C = 15

library(MOTE)

stats <- d.single.t(
  m = 30.4,
  u = 15,
  sd = 22.53,
  n = 96
)

SE <- sqrt(1/stats$n + stats$d^2/(2*stats$n))

# print just the d value and confidence intervals
data.frame(d = round(stats$d,3), 
           SE = round(SE,3),
           ci.lb = round(stats$dlow,3), 
           ci.ub = round(stats$dhigh,3))

      d    SE ci.lb ci.ub
1 0.684 0.113  0.46 0.904

As you can see, the output shows that the effect size is \(d_s\) = 0.68, 95% CI [0.46, 0.90].

7.3 Two Independent Groups Design

7.3.1 Standardize by Pooled Standard Deviation \((d_p)\)

For a design that consists of two independent groups (we can denote these as group \(A\) and group \(B\)), the standardized mean difference can be calculated by (equation 5.1, Glass, McGaw, and Smith 1981),

\[ d_p = \frac{M_A-M_B}{S_p}, \tag{7.4}\]

where \(S_p\) is the pooled standard deviation defined as (pp. 108, Glass, McGaw, and Smith 1981),

\[ S_p = \sqrt{\frac{(n_A-1)S_A^2 + (n_B-1)S_B^2}{n_A+n_B-2}}. \tag{7.5}\]

Using the pooled standard deviation as the standardizer characterizes the classic formulation of Cohen’s \(d\). This formulation requires the assumption that the variances (likewise the standard deviations) are equal between groups in the population. If this assumption is not met then it is recommended to use some of the other \(d\) value formulations later in this section. We can interpret the \(d_p\) value as the number of standard deviations the mean of group A is away from the mean of group B. A positive \(d_p\) value would indicate that the mean of group \(A\) is larger than the mean of group \(B\) and vice versa for a negative \(d_p\) value.

Cohen’s \(d_p\) is related to the t-statistic from an independent samples t-test. In fact, we can calculate the \(d_p\) value from the \(t\)-statistic with the following formula (equation 5.3, Glass, McGaw, and Smith 1981):

\[ d_p = t\sqrt{\frac{1}{n_A} + \frac{1}{n_B}}. \tag{7.6}\]

The corresponding standard error of \(d_p\) is,

\[ SE_{d_p} = \sqrt{\frac{n_A+n_B}{n_A n_B}+\frac{d_p^2}{2(n_A+n_B)}}. \tag{7.7}\]

In R, we can use the escalc() function from the metafor package to calculate the two group standardized mean difference (using the measure = "SMD" argument).

# use metafor package
library(metafor)

## Standardized mean difference for two independent groups

# Given means and SDs
# For example:
# Group A Mean = 30.4, SD = 22.53, N = 96
# Group B Mean = 21.4, SD = 19.59, N = 96

stats <- escalc(
  measure = "SMD",
  m1i = 30.4,
  m2i = 21.4,
  sd1i = 22.53,
  sd2i = 19.59,
  n1i = 96,
  n2i = 96,
  var.names = c("d", "variance") # add informative labels
)

# print output
summary(stats)


       d variance    sei     zi   pval  ci.lb  ci.ub 
1 0.4246   0.0213 0.1460 2.9093 0.0036 0.1386 0.7107

# Given t-test statistics
# For example:
# t = 2.954, nA = 96, nB = 96

stats <- escalc(
  measure = "SMD",
  ti = 2.954,
  n1i = 96,
  n2i = 96,
  var.names = c("d", "variance") # add informative labels
)

# print output
summary(stats)


       d variance    sei     zi   pval  ci.lb  ci.ub 
1 0.4247   0.0213 0.1460 2.9097 0.0036 0.1386 0.7108

The output for both examples show that the effect size is \(d_p\) = 0.42, 95% CI [0.14, 0.71].

7.3.2 Standardize by Control Group Standard Deviation (\(d_{\Delta}\))

When two groups differ substantially in their standard deviations, we can instead standardize by one of the two available group’s standard deviation, typically this is the control or reference group. In our scenario let’s suppose that group \(B\) is the control/reference group and therefore we can use the standard deviation of group \(B\) (\(S_B\)) as the standardizer, such that,

\[ d_{\Delta} = \frac{M_A-M_B}{S_B}. \tag{7.8}\]

This formulation is commonly referred to as Glass’ \(\Delta\) (Glass 1981). The standard error for \(d_{\Delta}\) can be defined as,

\[ SE_{d_{\Delta}} = \sqrt{\frac{n_A+n_B}{n_A n_B} + \frac{d_\Delta^2}{n_B+1} } \tag{7.9}\]

Standardizing by the control group standard deviation rather than pooling (as we did in the previous section with \(d_p\)) results in less degrees of freedom (\(df=n_C-1\)) and therefore a larger standard error. In R, we can use the escalc() function from the metafor package to calculate \(d_\Delta\) (using the measure = "SMD1" argument). Since we have already loaded in the metafor package, we do not need to do it again.

# Glass' delta (standardize by control group)
# given difference score means and SDs

# For example:
# group A Mean = 30.4, SD = 22.53, N = 96
# group B Mean = 21.4, SD = 19.59, N = 96

stats <- escalc(
  measure = "SMD1",
  m1i = 30.4,
  m2i = 21.4,
  sd2i = 19.59, # Note: use sd2i for whichever group needs to be standardized
  n1i = 96,
  n2i = 96,
  var.names = c("d", "variance") # add informative labels
)

# print the SDM value and confidence intervals
summary(stats)


       d variance    sei     zi   pval  ci.lb  ci.ub 
1 0.4558   0.0219 0.1480 3.0788 0.0021 0.1656 0.7459

7.4 Repeated Measures Designs

In a repeated-measures design, the same subjects (or items, etc.) are measured on two or more separate occasions, or in multiple conditions within a single session, and we want to know the mean difference between those occasions or conditions (Baayen, Davidson, and Bates 2008; Barr et al. 2013). An example of this would be in a pre/post comparison where subjects are tested before and after undergoing some treatment (see Figure 7.1 for a visualization). A standardized mean difference in a repeated-measures design can take on a few different forms that we define below.

Figure 7.1: Figure displaying simulated data of a repeated measures design, the x-axis shows the condition (e.g., pre-test and post-test) and y-axis is the scores. Lines indicate within person pre/post change.

7.4.1 Difference Score \(d\) (\(d_z\))

Instead of comparing the means of two sets of scores, a within subject design allows us to subtract the scores obtained in condition 1 from the scores in condition 2. The means and standard deviations of difference scores (\(X_{\text{diff}}=X_2-X_1\)) can be treated similarly to that of a single group design (if the target value was zero, i.e., \(C=0\)) such that (equation 2.3.5, Cohen 1988),

\[ d_z = \frac{M_{\text{diff}}}{S_{\text{diff}}} \tag{7.10}\]

A positive \(d_z\) value would indicate that the scores in condition 2 are, on average, larger than scores than condition 1 and vice versa for a negative \(d_z\) value. A convenient aspect of \(d_z\) is that it has a straight-forward relationship with the paired \(t\)-statistic, \(d_z=\frac{t}{\sqrt{n}}\). This makes it very useful for power analyses. If the standard deviation of difference scores are not accessible, then it can be calculated using the standard deviation of condition 1 (\(S_1\)), the standard deviation of condition 2 (\(S_2\)), and the correlation between conditions (\(r\)) (equation 2.3.6, Cohen 1988):

\[ S_{\text{diff}}=\sqrt{S^2_1 + S^2_2 - 2 r S_1 S_2} \tag{7.11}\]

It is important to note that when the correlation between groups is large, then the \(d_z\) value will also be larger, whereas a small correlation will return a smaller \(d_z\) value. The standard error of \(d_z\) can be calculated similarly to the single group design such that,

\[ SE_{d_z} = \sqrt{\frac{1}{n}+\frac{d_z^2}{2n}} \tag{7.12}\]

In R, we can use the escalc() function from the metafor package to calculate \(d_z\) (using the measure = "SMCC" argument).

# Cohen's dz for difference scores


# From paired t-test
# For example:
# t = 10.70, N = 96
stats <- escalc(
  measure = "SMCC",
  ti = 10.70,
  ni = 96,
  var.names = c("d", "variance") # add informative labels
)

# print output
summary(stats)


       d variance    sei     zi   pval  ci.lb  ci.ub 
1 1.0834   0.0165 0.1286 8.4267 <.0001 0.8314 1.3354

# given difference score means and SDs
# For example:
# Difference Score Mean = 21.4, SD = 19.59, N = 96
stats <- escalc(
  measure = "SMCC",
  m1i = 21.4,
  m2i = 0, # per documentation, this value should be set to zero
  sd1i = 19.59,
  sd2i = 0, # per documentation, this value should be set to zero
  ri = 0, # per documentation, this value should be set to zero
  ni = 96,
  var.names = c("d", "variance") # add informative labels
)

# print output
summary(stats)


       d variance    sei     zi   pval  ci.lb  ci.ub 
1 1.0837   0.0165 0.1286 8.4283 <.0001 0.8317 1.3358

The output shows that the effect size is \(d_z\) = 1.08, 95% CI [0.83, 1.34].

7.4.2 Repeated Measures \(d\) (\(d_{rm}\))

For a within-group design, we want to compare the means of scores obtained from condition 1 and condition 2. The repeated measures standardized mean difference between the two conditions can be calculated by (equation 9, Lakens 2013),

\[ d_{rm} = \frac{M_2-M_1}{S_w}. \tag{7.13}\]

The standardizer here is the within-subject standard deviation, \(S_w\). The within-subject standard deviation can be defined as,

\[ S_{w}=\sqrt{\frac{S^2_1 + S^2_2 - 2 r S_1 S_2}{2(1-r)}}. \tag{7.14}\]

We can also express \(S_w\) in terms of the standard deviation of difference scores (\(S_{\text{diff}}\)),

\[ S_w = \frac{S_{\text{diff}}}{ \sqrt{2(1-r)} }. \tag{7.15}\]

Furthermore, we can even express \(d_{rm}\) in terms of the difference score standardized mean difference (\(d_z\)),

\[ d_{rm} = d_z \times \sqrt{2(1-r)}. \tag{7.16}\]

Ultimately, \(d_{rm}\) is more appropriate as an effect size estimate for use in meta-analysis whereas \(d_z\) is more appropriate for power analysis (Lakens 2013). The standard error for \(d_{rm}\) can be computed as,

\[ SE_{d_{rm}} = \sqrt{\left(\frac{1}{n} + \frac{d^2_{rm}}{2n}\right) \times 2(1-r)} \tag{7.17}\]

In R, we can use the d.ind.t.rm function from the MOTE package to calculate the repeated measures standardized mean difference (\(d_{rm}\)).

# Cohen's d for repeated measures
# given means and SDs and correlation
library(MOTE)

# For example:
# Condition 1 Mean = 30.4, SD = 22.53, N = 96
# Condition 2 Mean = 21.4, SD = 19.59, N = 96
# correlation between conditions: r = .40

stats <- d.dep.t.rm(
  m1 = 30.4,
  m2 = 21.4,
  sd1 = 22.53,
  sd2 = 19.59,
  r = .40,
  n = 96,
  a = 0.05
)

SE = sqrt( (1/stats$n + stats$d^2/(2*stats$n)) * 2*(1-stats$r))

# print just the d value and confidence intervals
data.frame(d = round(stats$d,3), 
           SE = round(SE,3),
           ci.lb = round(stats$dlow,3), 
           ci.ub = round(stats$dhigh,3))

      d    SE ci.lb ci.ub
1 0.425 0.117 0.215 0.633

The output shows that the effect size is \(d_{rm}\) = 0.43, 95% CI [0.22, 0.63].

7.4.3 Average Variance \(d\) (\(d_{av}\))

The problem with \(d_{z}\) and \(d_{rm}\), is that they require the correlation between conditions. In practice, correlations between conditions are frequently not reported. An alternative estimator of \(d\) in repeated measures design is to simply use the classic variation of Cohen’s \(d_p\) (i.e., pooled standard deviation). In a repeated measures design, the sample size does not change between conditions, therefore weighting the variance of condition 1 and condition 2 by their respective degrees of freedom is an unnecessary step. Instead, we can standardize by the square root of the average the variances of condition 1 and 2 (see equation 5, Algina and Keselman 2003):

\[ d_{av} = \frac{M_2 - M_1}{\sqrt{\frac{S_1^2 + S_2^2}{2}}} \tag{7.18}\]

This formulation is convenient especially when the correlation between conditions is not present, however without the correlation it fails to take into account the consistency of change between conditions. However, the consistency of scores is taken into account in the standard error of the \(d_{av}\) which can be expressed as (equation 9, Algina and Keselman 2003),

\[ SE_{d_{av}}= \sqrt{\frac{2(S^2_1 + S^2_2 - 2rS_1S_2)}{n(S_1^2+S^2)}} \tag{7.19}\]

As we might expect, the higher the correlation (the more consistent the change in scores between conditions) the smaller the standard error. In R, we can use the escalc() function from the metafor package to calculate the average variance standardized mean difference (\(d_{av}\); using the measure = "SMCRP" argument).

# Cohen's d for repeated measures (average variance)
# given means and SDs 

# For example:
# Condition 1 Mean = 30.4, SD = 22.53, N = 96
# Condition 2 Mean = 21.4, SD = 19.59, N = 96
# Correlation between conditions: r = .50

stats <- escalc(
  measure = "SMCRP",
  m1i = 30.4,
  m2i = 21.4, 
  sd1i = 22.53,
  sd2i = 19.59,
  ri = .50,
  ni = 96,
  var.names = c("d", "variance") # add informative labels
)

# print just the d value and confidence intervals
summary(stats)


       d variance    sei     zi   pval  ci.lb  ci.ub 
1 0.4242   0.0110 0.1049 4.0442 <.0001 0.2186 0.6298

The output shows that the effect size is \(d_{av}\) = 0.42, 95% CI [0.22, 0.63].

7.4.4 Becker’s \(d\) (\(d_b\))

An even simpler variant of repeated measures \(d\) value comes from Becker (1988). Becker’s \(d\) standardizes simply by the pre-test standard deviation (we will denote the pre-test with condition 1) when the comparison is a pre/post design,

\[ d_b = \frac{M_2-M_1}{S_1}. \tag{7.20}\]

A convenient aspect of Becker’s d is in the use of the raw score standard deviation (\(S_1\)) as the standardizer. This allows us to interpret \(d_b\) in units of standard deviations of pre-test scores, whereas for \(d_z\) and \(d_{rm}\) the interpretation of the effect size units are less clear.

We can also obtain the standard error with (equation 13, Becker 1988),

\[ SE_{d_b} = \sqrt{\frac{2(1-r)}{n}+\frac{d_b^2}{2n}} \tag{7.21}\]

Notice that even though the formula for calculating \(d_b\) did not include the correlation coefficient, the standard error does. Using the escalc() function, we can calculate Becker’s formulation of standardized mean difference (using the measure = "SMCR" argument).

# Cohen's d for repeated measures standardized with pre-test SD (becker's d)
# given means, the pre-test SDs, and the correlation

# For example:
# Pre-test Mean = 21.4, SD = 19.59, N = 96
# Post-test Mean = 30.4, N = 96
# Correlation between conditions: r = .40

# NOTE: MEANS FLIPPED SO THAT M2 - M1 
# (by default escalc does M1 - M2)
stats <- escalc(
  measure = "SMCR",
  m1i = 30.4, # post-test mean
  m2i = 21.4, # pre-test mean
  sd1i = 22.53, # pre-test SD
  ri = .50, 
  ni = 96,
  var.names = c("d", "variance") # add informative labels
)

# print just the d value and confidence intervals
summary(stats)


       d variance    sei     zi   pval  ci.lb  ci.ub 
1 0.3963   0.0112 0.1060 3.7389 0.0002 0.1886 0.6040

The output shows that the effect size is \(d_b\) = 0.40, 95% CI [0.19, 0.60].

7.4.5 Comparing Repeated Measures \(d\) values

Figure 7.2 shows repeated measures designs with a high (\(r=\) .95) and low (\(r=\) .05) correlations between conditions. Let us fix the standard deviations and means for both conditions and only vary the correlation. Now we can compare the repeated measures estimators based on these two conditions shown in Figure 7.2:

High correlation:
- \(d_z=1.24\)
- \(d_{rm}=0.39\)
- \(d_{av}=0.43\)
- \(d_{b}=0.40\)
Low correlation:
- \(d_z=0.31\)
- \(d_{rm}=0.43\)
- \(d_{av}=0.43\)
- \(d_{b}=0.40\)

We notice that the correlation greatly influences \(d_z\) more than any other estimator. The \(d_{rm}\) value has very little change, whereas \(d_{av}\) and \(d_{b}\) do not take into account the correlation at all.

Figure 7.2: Figure displaying simulated data of a repeated measures design, the x-axis shows the condition (e.g., pre-test and post-test) and y-axis is the scores. Left panel shows a high pre/post correlation (\(r\) = .95) and right panel shows a low correlation condition (\(r\) = .05). Lines indicate within person pre/post change.

7.5 Pretest-Posttest-Control Group Designs

In many areas of research both between and within group factors are incorporated. For example, in research involving the examination of the effects of an intervention often a sample is randomised into two seperate groups (intervention and control) and then they are measured on the outcome of interest both before (pre-test) and after (post-test) the intervention/control period. In these types of 2x2 (group x time) study designs it is usually the difference between the standardized mean change for the intervention/treatment (\(T\)) and control (\(C\)) groups that is of interest. For a visualization of a pretest-posttest-control group design see Figure 7.3.

Morris (2008) details three effect sizes for this pretest-posttest-control (PPC).

Figure 7.3: Illustration of a pre-post control design. Left panel shows the pre-post difference in the control group and right panel shows the pre-post difference in the intervention/treatment group. Lines indicate within person pre/post change.

7.5.1 PPC1 - separate pre-test standard deviations

The separate pre-test (i.e., baseline) standard deviations are used to standardize the pre/post mean difference in the intervention group and the control group respectively (see equation 4, Morris 2008),

\[ d_T = \frac{M_{T,\text{post}} - M_{T,\text{pre}}}{S_{T,\text{pre}}} \tag{7.22}\]

\[ d_C = \frac{M_{C,\text{post}} - M_{C,\text{pre}}}{S_{C,\text{pre}}} \tag{7.23}\]

Note that these effect sizes are identical to the Becker’s \(d\) formulation of the SMD (see Section 7.4.4). Therefore the pretest-posttest-control group effect size is simply the difference between the intervention and control pre/post SMD (equation 15, Becker 1988),

\[ d_{PPC1} = d_T - d_C \tag{7.24}\]

The asymptotic standard error of \(d_{PPC1}\) was first derived by Becker (1988) and can be expressed as the square root of the sum of the sampling variances (equation 16, Becker 1988),

\[ SE_{d_{PPC1}} = \sqrt{\left[\frac{2(1-r_T)}{n_T} + \frac{d_T}{2n_T}\right] + \left[\frac{2(1-r_C)}{n_C} + \frac{d_C}{2n_C}\right]}. \tag{7.25}\]

Note that this is an approximate formula for the standard error, for an exact solution see Morris (2000). We can calculate \(d_{PPC1}\) and it’s confidence intervals using the metafor package:

# Example:

# Control Group (N = 90)
## Pre-test Mean = 20, SD = 6
## Post-test Mean = 25, SD = 7
## Pre/post correlation = .50

# Intervention Group (N = 90)
## Pre-test Mean = 20, SD = 5
## Post-test Mean = 27, SD = 8
## Pre/post correlation = .50


# calculate the observed standardized mean difference
# treatment group effect
dT <- escalc(measure = "SMCR",
             m1i = 27,
             m2i = 20,
             sd1i = 8,
             sd2i = 5,
             ri = .50,
             ni = 90)

# control group effect
dC <- escalc(measure = "SMCR",
             m1i = 25,
             m2i = 20,
             sd1i = 7,
             sd2i = 6,
             ri = .50,
             ni = 90)

# calculate d and SE
dPPC1 <- dT$yi - dC$yi
SE <- sqrt(dT$vi + dC$vi)


# print the d value and confidence intervals
data.frame(d = round(dPPC1,3),
           SE = round(SE,3),
           ci.lb = round(dPPC1 - 1.96*SE,3),
           ci.ub = round(dPPC1 + 1.96*SE,3))

      d    SE  ci.lb ci.ub
1 0.159 0.171 -0.176 0.494

The output shows a pre-post intervention effect of \(d_{PPC1}\) = 0.16 95% CI [-0.18, 0.49].

7.5.2 PPC2 - pooled pre-test standard deviations

The pooled pre-test (i.e., baseline) standard deviations can be used to standardized the difference in pre/post change between intervention and control groups such that (equation 8, Morris 2008),

\[ d_{PPC2} = \frac{(M_{T,\text{post}} - M_{T,\text{pre}}) - (M_{C,\text{post}} - M_{C,\text{pre}})}{S_{p,\text{pre}}} \tag{7.26}\]

where

\[ S_{p,\text{pre}} = \sqrt{\frac{(n_T-1)S^2_{T,\text{pre}} + (n_C - 1)S^2_{C,\text{pre}}}{n_T + n_C - 2}}. \tag{7.27}\]

The distribution of \(d_{PPC2}\) was described by Morris (2008) and can be expressed as (adapted from equation 24, Morris 2008),

\[ \small{SE_{d_{PPC2}} = \sqrt{2\left(1-\hat{\rho} \right)\left(\frac{n_T + n_C}{n_T n_C}\right)\left(\frac{n_T + n_C-2}{n_T +n_C-4}\right)\left[1 + \frac{d^2_{PPC2}}{2\left(1-\hat{\rho} \right)\left(\frac{n_T + n_C}{n_T n_C}\right)}\right] - \frac{d^2_{PPC2}}{CF^2}}}. \tag{7.28}\]

Note the original equation shown in the paper by Morris (2008) uses the population pre/post correlation \(\rho\), however in the equation above we replace \(\rho\) with the sample size weighted average of the Pearson correlation in the treatment and control group (i.e., \(\hat{\rho} = \frac{n_T r_T + n_C r_C}{n_T + n_C}\)). Also, \(CF\) is the correction factor that can be found in the following section on small sample bias.

We can use base R to obtain \(d_{PPC2}\) and confidence intervals:

# Example:

# Control Group (N = 90)
## Pre-test Mean = 20, SD = 6
## Post-test Mean = 25, SD = 7
## Pre/post correlation = .50
M_Cpre <- 20
M_Cpost <- 25
SD_Cpre <- 6
SD_Cpost <- 7
rC <- .50
nC <- 90

# Intervention Group (N = 90)
## Pre-test Mean = 20, SD = 5
## Post-test Mean = 27, SD = 8
## Pre/post correlation = .50
M_Tpre <- 20
M_Tpost <- 27
SD_Tpre <- 5
SD_Tpost <- 8
rT <- .50
nT <- 90

# calculate the observed standardized mean difference
dPPC2 <- ((M_Tpost- M_Tpre) - (M_Cpost - M_Cpre)) / sqrt( ( (nT - 1)*(SD_Tpre^2) + (nC - 1)*(SD_Cpre^2) ) / (nT + nC - 2) )

# calculate the standard error
rho <- (nT*rT+nC*rC)/(nT + nC)
CF <- gamma((nT+nC-2)/2) / ( sqrt((nT+nC-2)/2) * gamma(((nT+nC-2)-1)/2) )
SE <-  sqrt(2*(1-rho) * (nT+nC)/(nT*nC) * (nT+nC-2)/(nT+nC-4) * (1 + (dPPC2^2 / (2*(1 - rho) * ((nT+nC)/(nT*nC))))) - dPPC2^2/CF^2)

# print the d value and confidence intervals
data.frame(d = round(dPPC2,3),
           SE = round(SE,3), 
           ci.lb = round(dPPC2 - 1.96*SE,3),
           ci.ub = round(dPPC2 + 1.96*SE,3))

      d    SE ci.lb ci.ub
1 0.362 0.151 0.066 0.658

The output shows a pre-post intervention effect of \(d_{PPC2}\) = 0.36 95% CI [0.07, 0.66].

7.5.3 PPC3 - pooled pre- and post-test

The two previous effect sizes (PPC1 and PPC2) only use the pretest standard deviation and ignore the post-test standard deviation. However, if we are happy to assume that pretest and posttest variances are homogeneous¹ the pooled pre-test and post-test standard deviations can be used to standardize the difference in pre/post change between intervention and control groups, such that (equation 8, Morris 2008),

\[ d_{PPC3} = \frac{(M_{T,\text{post}} - M_{T,\text{pre}}) - (M_{C,\text{post}} - M_{C,\text{pre}})}{S_{p,\text{pre-post}}}, \tag{7.29}\]

where,

\[ S_{p,\text{pre-post}} = \sqrt{\frac{(n_T-1)\left(S^2_{T,\text{pre}} + S^2_{T,\text{post}}\right) + (n_C - 1)\left(S^2_{C,\text{pre}} + S^2_{C,\text{post}}\right)}{2(n_T + n_C - 2)}}. \tag{7.30}\]

The standard error for \(d_{PPC3}\) is currently unknown. An option to estimate this standard error is to use a non-parametric or parametric bootstrap by repeatedly sampling the raw data, or if the raw data is not available resample simulated data. We can do this in base R by simulating pre/post data using the mvrnorm() function from the MASS package (Venables and Ripley 2002):

# Install the package below if not done so already
# install.packages(MASS)

# Example:

# Control Group (N = 90)
## Pre-test Mean = 20, SD = 6
## Post-test Mean = 25, SD = 7
## Pre/post correlation = .50
M_Cpre <- 20
M_Cpost <- 25
SD_Cpre <- 6
SD_Cpost <- 7
rC <- .50
nC <- 90

# Intervention Group (N = 90)
## Pre-test Mean = 20, SD = 5
## Post-test Mean = 27, SD = 8
## Pre/post correlation = .50
M_Tpre <- 20
M_Tpost <- 27
SD_Tpre <- 5
SD_Tpost <- 8
rT <- .50
nT <- 90

# simulate data
set.seed(1) # set seed for reproducibility
boot_dPPC3 <- c()
for(i in 1:1000){
  # simulate control group pre-post data
  data_C <- MASS::mvrnorm(n = nC,
                          # input observed means
                          mu = c(M_Cpre,M_Cpost),
                          # input observed covariance matrix
                          Sigma = data.frame(pre = c(SD_Cpre^2, rC*SD_Cpre*SD_Cpost), 
                                             post = c(rC*SD_Cpre*SD_Cpost,SD_Cpost^2)))
  # simulate intervention group pre-post data
  data_T <- MASS::mvrnorm(n = nT,
                          # input observed means
                          mu = c(M_Tpre,M_Tpost),
                          # input observed covariance matrix
                          Sigma = data.frame(pre = c(SD_Tpre^2, rT*SD_Tpre*SD_Tpost), 
                                             post = c(rT*SD_Tpre*SD_Tpost,SD_Tpost^2)))
  
  # calculate the mean difference in pre/post change (the numerator)
  MeanDiff <- (mean(data_T[,2]) - mean(data_T[,1])) - (mean(data_C[,2]) - mean(data_C[,1]))
  
  # calculate the pooled pre-post standard deviation (the denominator)
  S_Pprepost <-  sqrt( ( (nT - 1)*(sd(data_T[,1])^2+sd(data_T[,2])^2) + (nC - 1)*(sd(data_C[,1])^2+sd(data_C[,2])^2) ) / (2*(nT + nC - 2)) )
  
  # calculate the standardized mean difference for each bootstrap iteration
  boot_dPPC3[i] <- MeanDiff / S_Pprepost
}

# calculate bootstrapped standard error
SE <- sd(boot_dPPC3)

# calculate the observed standardized mean difference
dPPC3 <- ((M_Tpost- M_Tpre) - (M_Cpost - M_Cpre)) / sqrt( ( (nT - 1)*(SD_Tpre^2+SD_Tpost^2) + (nC - 1)*(SD_Cpre^2+SD_Cpost^2) ) / (2*(nT + nC - 2)) )

#print the d value and confidence intervals
data.frame(d = round(dPPC3,3),
           SE = round(SE,3), 
           ci.lb = round(dPPC3 - 1.96*SE,3),
           ci.ub = round(dPPC3 + 1.96*SE,3))

      d    SE ci.lb ci.ub
1 0.303 0.153 0.003 0.604

The output shows a pre-post intervention effect of \(d_{PPC3}\) = 0.30 95% CI [0.003, 0.60].

7.6 Small Sample Bias in \(d\) values

All the estimators of \(d\) listed above are biased estimates of the population \(d\) value, specifically they all over-estimate the population value in small sample sizes. To adjust for this bias, we can apply a correction factor based on the degrees of freedom. The degrees of freedom will largely depend on the estimator used. The degrees of freedom for each estimator is listed below:

Single Group design (\(d_s\)): \(df = n-1\)
Between Groups - Pooled Standard Deviation (\(d_p\)): \(df = n_A+n_B-2\)
Between Groups - Control Group Standard Deviation (\(d_\Delta\)): \(df = n_B-1\)
Repeated Measures - all types (\(d_z\), \(d_{rm}\), \(d_{av}\), \(d_{b}\)): \(df = n-1\)
Pretest-Posttest-Control Separate Standard Deviation (\(d_{PPC1}\)): \(df_C=n_C−1,\; df_T=n_T−1\)
Pretest-Posttest-Control Pooled Pretest Standard Deviation (\(d_{PPC2}\)): \(df=n_T+n_C−2\)
Pretest-Posttest-Control Pooled Pretest and Posttest Standard Deviation (\(d_{PPC3}\)): \(df=2(n_T+n_C−2)\)

With the appropriate degrees of freedom, we can use the following correction factor, \(CF\), to obtain an unbiased estimate of the population standardized mean difference:

\[ CF = \frac{\Gamma\left(\frac{df}{2}\right)}{\Gamma\left(\frac{df-1}{2}\right)\sqrt{\frac{df}{2}}} \tag{7.31}\]

Where \(\Gamma(\cdot)\) is the gamma function. An approximation of this complex formula given by Hedges (1981) can be written as \(CF\approx 1-\frac{3}{4\cdot df -1}\). In R, this can be calculated using,

# Example (independent groups d_p):
# Group 1 sample size = 20
# Group 2 sample size = 18
n1 <- 20
n2 <- 18

# calculate degrees of freedom
df <- n1 + n2 - 2

# calculate correction factor
CF <- gamma(df/2) / ( sqrt(df/2) * gamma((df-1)/2) )

# print
CF

[1] 0.9789964

This correction factor can then be applied to any of the standardized mean difference variants mentioned above,

\[ d^* = d\times CF \tag{7.32}\]

The corrected \(d\) value, \(d^*\), is commonly referred to as Hedges’ \(g\) or just \(g\). To avoid notation confusion we will just add an asterisk to \(d\) to denote the correction. Note that in the case of \(d_{PPC1}\), we must apply \(CF\) to both \(d_C\) and \(d_T\) such that, \(d^*_{PPC1} = d_T\times CF_T - d_C\times CF_C\). We also need to correct the standard error for \(d^*\) using the same correction factor,

\[ SE_{d^*} = SE_{d} \times CF \tag{7.33}\]

These standard errors can then be used to calculate the confidence interval of the corrected \(d\) value,

\[ CI_{d*} = d^* \pm 1.96\times SE_{d^*} \tag{7.34}\]

It is very important to note that the escalc() function automatically applies the small sample correction by default, therefore any code that utilizes escalc() do not also apply the correction factor.

# Example:
# Cohen's d = .50, SE = .10

d = .50
SE = .10

# correct d value and CIs small sample bias
d_corrected <- d * CF
SE_corrected <- SE * CF
ci.lb_corrected <- d_corrected - 1.96*SE_corrected
ci.ub_corrected <- d_corrected + 1.96*SE_corrected

# print just the d value and confidence intervals
data.frame(d = round(d_corrected,3), 
           SE = SE_corrected,
           ci.lb = round(ci.lb_corrected,3), 
           ci.ub = round(ci.ub_corrected,3))

      d         SE ci.lb ci.ub
1 0.489 0.09789964 0.298 0.681

The output shows that the corrected effect size is \(d^*\) = 0.50, 95% CI [0.30, 0.68].

7.7 Ratios of Means

Another common approach, particularly within the fields of ecology and evolution, is to take the natural logarithm of the ratio between two means; the so-called Response Ratio (\(LRR\)). This is sometimes more favorable as, due to its construction using the standard deviation in some form as a denominator, the various versions of standardized mean differences are impacted by the estimate of this parameter for which studies are often less powered compared to mean magnitudes (Yang et al. 2022). For the \(LRR\) however the standard deviation only impacts its variance estimation and not the point estimate. A limitation of the \(LRR\) however is that it is limited to data that are observed on a ratio scale (i.e., have an absolute zero and instances of it are related ordinally and additively meaning both means will be positive).

Although strictly speaking the \(LRR\) is not a difference in means in an additive sense as the above standardized mean difference effect sizes are, it can in one sense be considered to reflect the difference in means on the multiplicative scale. In fact, after calculation it is often transformed to reflect the percentage difference or change between means: \(100\times \exp(LRR)-1\). However, this can introduce transformation induced bias because a non-linear transformation of a mean value is not generally equal to the mean of the transformed value. In the context of meta-analysis, when combining \(LRR\) estimates across studies a correction factor can be applied: \(100\times \exp(LRR+0.5 S^2_\text{total})-1\), where \(S^2_\text{total}\) is the variance of all \(LRR\) values.

Similarly to the various standardized mean differences, there are varied calculations for the \(LRR\) dependent upon the study design being used (see Senior, Viechtbauer, and Nakagawa 2020).

7.7.1 Response Ratio for Independent Groups (\(LRR_\text{ind}\))

When calculating the response ratio for two independent groups (group \(A\) and \(B\)). The \(LRR\) can be calculated as follows,

\[ LRR_\text{ind}=\ln\left(\frac{M_A}{M_B}\right)+CF \tag{7.35}\]

Where \(M_A\) and \(M_B\) are the means for group \(A\) and \(B\), respectively. \(CF\) is the small sample correction factor calculated as,

\[ CF = \frac{S^2_A}{2n_AM_A^2} - \frac{S^2_B}{2n_BM_B^2}. \tag{7.36}\]

Where \(n_A\) and \(n_B\) are the sample sizes. The standard error can be calculated as,

\[ SE_{LRR_\text{ind}} = \sqrt{ \frac{S^2_A}{n_A M_A^2} + \frac{S^2_B}{n_B M_B^2} +\frac{S^4_A}{2n^2_A M_A^4} + \frac{S^4_B}{2n^2_B M_B^4}} \tag{7.37}\]

Using R we can easily calculate this effect size using the escalc() function in the metafor package (Viechtbauer 2010):

# LRR for two independent groups
# given means and SDs

# For example:
# Group A Mean = 30.4, Standard deviation = 22.53, Sample size = 96
# Group B Mean = 21.4, Standard deviation = 19.59, Sample size = 96

# calculate lnRRind and standard error
LRRind <- escalc(measure = "ROM", 
                 m1i = 30.4,
                 m2i = 21.4,
                 sd1i = 22.53,
                 sd2i = 19.59,
                 n1i = 96,
                 n2i = 96)

summary(LRRind)


      yi     vi    sei     zi   pval  ci.lb  ci.ub 
1 0.3511 0.0145 0.1202 2.9203 0.0035 0.1154 0.5867

The example shows a response ratio of \(LRR_\text{ind}\) = 0.35 95% CI [0.12, 0.59].

7.7.2 Response Ratio for Dependent Groups (\(LRR_\text{dep}\))

When we have dependent samples (e.g., a pre/post comparison), the \(LRR\) can be calculated as follows,

\[ LRR_\text{dep} = \ln\left(\frac{M_2}{M_1}\right) + CF \tag{7.38}\]

Where \(M_1\) and \(M_2\) are the means for conditions 1 and 2, respectively. Where \(CF\) is the small sample correct factor calculated as,

\[ CF = \frac{S^2_2}{2nM^2_2} - \frac{S^2_1}{2nM^2_1} \tag{7.39}\]

The standard error can then be calculated as,

\[ SE_{LRR_\text{dep}} = \sqrt{ \frac{S^2_1}{n M_1^2} + \frac{S^2_2}{n M_2^2} + \frac{S^4_1}{2n^2M^4_1} + \frac{S^4_2}{2n^2M^4_2} + \frac{2rS_1 S_2}{n M_1 M_2} + \frac{r^2S^2_1 S^2_2 (M_1^4 + M_2^4)}{2n^2 M_1^4 M_2^4}} \tag{7.40}\]

Using R we can easily calculate this effect size using the escalc() function from the metafor package as follows:

# LRR for two dependent groups
# given means and SDs


# For example:
# Mean 1 = 30.4, Standard deviation 1 = 22.53
# Mean 2 = 21.4, Standard deviation 2 = 19.59
# Sample size = 96
# Correlation = 0.4

# calculate lnRR and standard error
LRRdep <- escalc(measure = "ROMC", 
                 m1i = 30.4,
                 m2i = 21.4,
                 sd1i = 22.53,
                 sd2i = 19.59,
                 ni = 96,
                 ri = .40)

summary(LRRdep)


      yi     vi    sei     zi   pval  ci.lb  ci.ub 
1 0.3511 0.0088 0.0938 3.7429 0.0002 0.1672 0.5349

The example shows a response ratio of \(LRR_\text{dep}\) = 0.35 95% CI [0.17, 0.53].

Algina, James, and H. J. Keselman. 2003. “Approximate Confidence Intervals for Effect Sizes.” Educational and Psychological Measurement 63 (4): 537–53. https://doi.org/10.1177/0013164403256358.

Baayen, R Harald, Douglas J Davidson, and Douglas M Bates. 2008. “Mixed-Effects Modeling with Crossed Random Effects for Subjects and Items.” Journal of Memory and Language 59 (4): 390–412.

Barr, Dale J, Roger Levy, Christoph Scheepers, and Harry J Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” Journal of Memory and Language 68 (3): 255–78.

Becker, Betsy J. 1988. “Synthesizing Standardized Mean-Change Measures - UConn Library.” British Journal of Mathematical and Statistical Psychology 41 (2): 257278. https://doi.org/https://doi.org/10.1111/j.2044-8317.1988.tb00901.x.

Buchanan, Erin M., Amber Gillenwaters, John E. Scofield, and K. D. Valentine. 2019. MOTE: Measure of the Effect: Package to Assist in Effect Size Calculations and Their Confidence Intervals. http://github.com/doomlab/MOTE.

Caldwell, Aaron R. 2022. “Exploring Equivalence Testing with the Updated TOSTER r Package.” PsyArXiv. https://doi.org/10.31234/osf.io/ty8de.

Cohen, Jacob. 1988. Statistical Power Analysis for the Behavioral Sciences. Academic Press.

Glass, Gene V. 1981. “Meta-Analysis in Social Research.” (No Title). https://cir.nii.ac.jp/crid/1130000795088566912.

Glass, Gene V., Barry McGaw, and Mary L. Smith. 1981. “Meta-Analysis in Social Research.” (No Title). https://cir.nii.ac.jp/crid/1130000795088566912.

Hedges, Larry V. 1981. “Distribution Theory for Glass’s Estimator of Effect Size and Related Estimators.” Journal of Educational Statistics 6 (2): 107–28. https://doi.org/10.3102/10769986006002107.

Lakens, Daniël. 2013. “Calculating and Reporting Effect Sizes to Facilitate Cumulative Science: A Practical Primer for t-Tests and ANOVAs.” Frontiers in Psychology 4. https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00863.

Morris, Scott B. 2000. “Distribution of the Standardized Mean Change Effect Size for Meta-Analysis on Repeated Measures.” British Journal of Mathematical and Statistical Psychology 53 (1): 17–29.

———. 2008. “Estimating Effect Sizes From Pretest-Posttest-Control Group Designs.” Organizational Research Methods 11 (2): 364–86. https://doi.org/10.1177/1094428106291059.

Senior, Alistair M., Wolfgang Viechtbauer, and Shinichi Nakagawa. 2020. “Revisiting and Expanding the Meta-Analysis of Variation: The Log Coefficient of Variation Ratio.” Research Synthesis Methods 11 (4): 553–67. https://doi.org/10.1002/jrsm.1423.

Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with s. Fourth. New York: Springer. https://www.stats.ox.ac.uk/pub/MASS4/.

Viechtbauer, Wolfgang. 2010. “Conducting Meta-Analyses in R with the metafor Package.” Journal of Statistical Software 36 (3): 1–48. https://doi.org/10.18637/jss.v036.i03.

Yang, Yefeng, Helmut Hillebrand, Malgorzata Lagisz, Ian Cleasby, and Shinichi Nakagawa. 2022. “Low Statistical Power and Overestimated Anthropogenic Impacts, Exacerbated by Publication Bias, Dominate Field Studies in Global Change Biology.” Global Change Biology 28 (3): 969–89. https://doi.org/10.1111/gcb.15972.

Note, this may not be the case especially where there is a mean-variance relationship and one (usually the intervention) group has a higher posttest mean score.↩︎