5  Classical Measurement Errors

5.1 Introduction

Measurements are often calibrated to their true value such that errors over repeated measurements tend to cancel out. In cases like this, averaging across the outcomes of multiple measurements may provide a more precise value. However, not all measures are created equal, some perform better than others. Ideally, measurement procedures should produce consistent scores upon repeated measurements. This repeatability is referred to as the reliability of a measure. A high quality measure should produce highly reliable scores. This section will review what reliability is in theory, how to estimate reliability, and how to correct effect sizes for measurement error.

5.2 Classical Measurement Error Calibration

The values of a target random variable are often referred to as true score and can be defined by some real characteristic or quality of an individual that is independent of the measurement procedure (e.g., height of a person), we can consider this a realist approach. However, if the underlying data generating process is unknown, one could opt to define the true score as the expected value of measurement outcomes (e.g., average reaction time over repeated trials), which can be thought of as an operationalist approach. Therefore if we take the operationalist approach, a convenient way to define the true scores for each individual is as follows:

\[ X := \mathbb{E}\left[\widetilde{X}\;\middle|\; \mathcal{M}_\ell \right], \tag{5.1}\]

Where \(\mathcal{M}\) is defined as the \(\sigma\)–field generated by the assignment-to-individual function \(\ell\),

\[ \mathcal{M}_\ell = \left\lbrace\ell^{-1}(\psi) \;\middle|\; \psi\in\Psi \right\rbrace \subset \mathcal{F}_{\Omega} \]

where \(\psi\in\Psi\) is an individual \(\psi\) in the population of interest \(\Psi\) and \(\mathcal{F}_{\Omega}\) is the \(\sigma\)–field over the outcome space \(\Omega\). Put in simpler words, the true score for an individual \(\psi\in\Psi\) is defined as the expected value over all possible measurement outcomes for that individual. This is definition of true scores that is typically attributed to Classical Test Theory (Donald W. Zimmerman 1975). As a consequence, this definition of true scores produces the following properties (Kroc and Zumbo 2020),

\[\begin{align} &\mathrm{I.}\;\;\;\sigma_{XE_X}=0 \\[.3em] &\mathrm{II.}\;\;\;\mathbb{E}\!\left[E_X\right]\!=0 \end{align} \tag{5.2}\]

where \(\sigma_{XE_X}\) is the covariance between the true value \(X\) and the associated error \(E_X\). Property \(\mathrm{I}\) ensures that there is no correlation between true scores and errors whereas property \(\mathrm{II}\) ensures that errors have a mean of zero in the population. A realist approach to measurement posits that \(X\) is not defined by the expectation of measurement outcomes, rather it is defined by a measurement independent attribute/characteristic. With this definition of true scores, properties \(\mathrm{I}\) and \(\mathrm{II}\) are not automatically true. If reasonable, the realist can assume that properties \(\mathrm{I}\) and \(\mathrm{II}\) are met. If the realist assumes these properties are met, then Equation 5.1 is a necessary consequence of those assumptions. In sum, the important difference between the realist and operationalist is that the properties described in Equation 5.2 must be assumed by the realist whereas it is a consequence of the definition of true scores for the operationalist.

Let’s denote a single instance of a measurement outcome with \(\widetilde{X}_{\mathsf{m}}:=\widetilde{X}(\omega)\) and a single instance of the true score \(X:=X(\omega)\) for some outcome \(\omega\in\ell^{-1}(\psi)\) for each individual \(\psi\in\Psi\). The subscript \(\mathsf{m}\) indexes the outcomes of a single measurement instance. Since the true score is fixed over all possible measurement outcomes within an individual this index is not needed for \(X\). For a measurement \(\mathsf{m}\), we can relate the true score \(X\) and the observed score \(\widetilde{X}_\mathsf{m}\) with the following algebraic structure,

\[ \widetilde{X}_{\mathsf{m}} = X + E_{X\mathsf{m}} \tag{5.3}\]

where the error is defined as \(E_{X\mathsf{m}} = \widetilde{X}_{\mathsf{m}} - X\). Since we have established that the covariance between \(X\) and \(E_X\) is zero, we can easily parse out the variance of the three components in Equation 5.3,

\[ \sigma^2_\widetilde{X} = \sigma^2_X + \sigma^2_{E_X}. \]

We can see that the variance of observed scores is a simple summation of the variance of true scores and error which tells us that if the measure was perfect (i.e., no measurement error) than the error variance would be zero and \(\sigma^2_\widetilde{X} = \sigma^2_X\).

5.3 Reliability

The reliability of \(X\) is defined as the variance of true scores over the variance in observed proxy scores \(X\) (i.e., an intra-class correlation),

\[ \rho_{\widetilde{X}\widetilde{X}'} = \frac{\sigma^2_X}{\sigma^2_\widetilde{X}}= \frac{\sigma^2_X}{\sigma^2_X + \sigma^2_{E_X}} \tag{5.4}\]

However we can also express it as the correlation between between two parallel measurements. Strictly parallel measurements have the following four properties (p. 69, Haertel 2006):

  1. Measurements have identical specifications. That is, each measurement uses the same measurement procedure.
  2. The distribution of observed scores for each measurement are identical: \(f_{\widetilde{X}_1} = f_{\widetilde{X}_2} = \dots\)
  3. Any set of two measurements are assumed to covary the same as any other set of two measurements: \(\sigma_{\widetilde{X}_1 \widetilde{X}_2} = \sigma_{\widetilde{X}_2 \widetilde{X}_3} = \sigma_{\widetilde{X}_1 \widetilde{X}_3} = \ldots\)
  4. Each measurement equally covaries with any other variable: \(\sigma_{\widetilde{X}_1 Y} = \sigma_{\widetilde{X}_2 Y} = \ldots\)

These properties will be useful especially when we discuss composite scores. The correlation between parallel measurements can also be expressed by the square of the correlation between observed proxy scores and true scores, \(\rho_{X\widetilde{X}}^2\) . To understand why this is the case, note that the covariance between parallel forms of a measure is equivalent to the covariance between observed scores and true scores, \(\sigma_{\widetilde{X}X}=\)\(\sigma_{(\widetilde{X}+E_X)X}=\)\(\sigma^2_X + \sigma_{XE_X}=\)\(\sigma^2_X = \sigma_{\widetilde{X}\widetilde{X}'}\) (Haertel 2006),

\[ \rho_{\widetilde{X}\widetilde{X}'} = \frac{\sigma_X^2}{\sigma_\widetilde{X}^2} = \frac{\left(\sigma_X^2\right)^2}{\sigma_\widetilde{X}^2 \sigma_X^2}= \frac{\sigma_{\widetilde{X}X}^2}{\sigma^2_\widetilde{X}\sigma^2_X} = \rho^2_{X\widetilde{X}}. \]

Estimating reliability can easily be done if a researcher has access to parallel measurements as they would simply need to compute the correlation between the two sets of observed scores. This is not always the case however, therefore we will cover different ways of calculating the reliability of a measure depending on the available data

5.3.1 Reliability of Composites

Composite measurements try to average across multiple sub-measurements (i.e., items, trials, ratings) to obtain a more precise score. Let \(\mathsf{\tilde{x}}_q\) denote the outcome of sub-measurement \(q\) such that,

\[ \mathsf{\tilde{x}}_q= \mathsf{x}_q + \mathsf{e}^\mathsf{x}_q \tag{5.5}\]

Where \(q=1{...}\kappa\) indexes the sub-measurement, \(\mathsf{x}_q\) denotes the true score and \(\mathsf{e}^\mathsf{x}_q\) denotes the error of the \(q\)th sub-measurement. Each of the three components in Equation 5.5 are defined similarly to Equation 5.3 with the exception that the true score (the expected value) is allowed to vary between sub-measurements. A composite score reduces measurement error by combining multiple sub-measurements together. Therefore, measurement \(\mathsf{m}\) of some individual \(\psi\in\Psi\) is a weighted average of sub-measurement outcomes,

\[ \widetilde{X}_\mathsf{m} = \frac{\sum^q_{1...\kappa}\mathsf{w}_q\mathsf{\tilde{x}}_q}{\sum^q_{1...\kappa}\mathsf{w}_q} \tag{5.6}\]

Clearly, if the weights are uniform \(\mathsf{w}_1=\mathsf{w}_2=...=\mathsf{w}_\kappa\), then the composite becomes an unit-weighted average. Note that the weighted average of true sub-scores \(\mathsf{x}_q\) is equal to the true score \(X\).

The estimating the reliability of the composite \(\widetilde{X}\) is the primary purpose of this section. Before we can do that, we have to acknowledge that certain reliability coefficients, assumptions must be made that are analogous to the four assumptions about strictly parallel measurements. Depending on the reliability coefficient, not all of four of the assumptions of strictly parallel sub-measurements have to be met. Instead, we can talk about three conditions that hold different assumptions (Haertel 2006, 71):

  • Tau-equivalence: this is the strongest of the three conditions where true scores are fixed across sub-measurements \(\mathsf{x}_{1} = \mathsf{x}_{2} = \dots\) as well as true variances are equivalent \(\sigma^2_\mathsf{x_1} = \sigma^2_\mathsf{x_2} = \dots\). However, the variance of observed scores are allowed to vary between sub-measurements, in other words, the errors .

  • Essential Tau-equivalence: this condition is looser than tau-equivalence as it allows the true score to vary between sub-measurements. Each sub-measurement is related to each other by an additive constant such that, \(\mathsf{x}_{1} = \mathsf{x}_{2} + \mathsf{c}_{12}\) (note that this constant is the same for all individuals). The true variances still must be equivalent \(\sigma^2_\mathsf{x_1} = \sigma^2_\mathsf{x_2}\). However, the variance of observed scores are still allowed to vary between sub-measurements.

  • Congeneric Measurements: this is the loosest of the three conditions. Congeneric measurements do not require that true variances are equal between sub-measurements are equal because now it relates the true scores between measurements as a linear function, \(\mathsf{x}_{1} = \mathsf{b}_{12}\mathsf{x}_{2} + \mathsf{c}_{12}\). This allows the mean and variance of the true scores to differ across sub-measurements.

Figure 5.1: Figures showing the observed scores upon 10 sub-measurements and the composite observed score (dark point + error bars). The left panel shows a score with low reliability and the right shows an observed score with high reliability.

Coefficient Alpha

The most common reliability coefficient in psychological and social sciences is Coefficient alpha (also referred to as Cronbach’s Alpha, Cronbach 1951). Coefficient alpha assumes the sub-measurements meet the conditions described by essential tau-equivalence. If the sub-measurements are not essentially tau-equivalent than Coefficient alpha is a lower bound (something to keep in mind when we are correcting effect sizes). The equation for the unstandardized version of Coefficient alpha is,

\[ \rho_{\widetilde{X}\widetilde{X}'} = \frac{\kappa \,\bar{\sigma}_{\mathsf{\tilde{x}}_q\mathsf{\tilde{x}}_j}} {\bar{\sigma}^2_{\mathsf{\tilde{x}}_q}+ (\kappa-1)\bar{\sigma}_{\mathsf{\tilde{x}}_q\mathsf{\tilde{x}}_j}}\, , \;\;\;\;q\neq j. \tag{5.7}\]

where \(\bar{\sigma}_{\mathsf{x}_q\mathsf{x}_j}\) is the average of the off-diagonal elements of the covariance matrix (i.e., \(q\neq j\)) and \(\bar{\sigma}^2_{\mathsf{x}_q}\) is the average variance in observed scores across \(\kappa\) sub-measurements. Since covariance matrices are not always available to calculate may be unavailable, so instead we can compute a standardized alpha coefficient using the correlation matrix (Charles Spearman 1910; Brown 1910),

\[ \rho_{\widetilde{X}\widetilde{X}'} = \frac{\kappa \,\bar{\rho}_{\mathsf{\tilde{x}}_q\mathsf{\tilde{x}}_j}} {1+ (\kappa-1)\bar{\rho}_{\mathsf{\tilde{x}}_q\mathsf{\tilde{x}}_j}}\, , \;\;\;\;q\neq j. \tag{5.8}\]

Another situation where standardized alpha is preferrable is sub-measurements were on completely different scales and therefore scores had to be standardized before combining them together. The standardized coefficient alpha assumes The equation can simplify further if you have only two measurements. For the unstandardized variant of coefficient alpha, if there are only two sub-measurements than it reduces to a split-half reliability (Rulon 1939, eq. 53),

\[ \rho_{\widetilde{X}\widetilde{X}'} = \frac{4\,\bar{\sigma}_{\mathsf{\tilde{x}}_1\mathsf{\tilde{x}}_2}}{\sigma^2_{\mathsf{\tilde{x}}_1} + \sigma^2_{\mathsf{\tilde{x}}_2} + 2\bar{\sigma}_{\mathsf{\tilde{x}}_1\mathsf{\tilde{x}}_2}}. \tag{5.9}\] Whereas the standardized split-hald formula is even simpler,

\[ \rho_{\widetilde{X}\widetilde{X}'} = \frac{2\,\bar{\rho}_{\mathsf{\tilde{x}}_1\mathsf{\tilde{x}}_2}}{1 + \bar{\rho}_{\mathsf{\tilde{x}}_1\mathsf{\tilde{x}}_2}}. \tag{5.10}\]

It is worth noting that the Kuder-Richardson-20 reliability is identical to Equation 5.7 with the difference being that the observed sub-scores are Bernoulli random variables (i.e., 0/1, correct/incorrect, yes/no valued), which simplifies the formula (see Kuder and Richardson 1937)

Reliability of Agreeableness Mean-Score

The psych package (William Revelle 2023) contains data on the big 5 personality traits from a large sample of 2800 individuals. We can try to calculate the reliability coefficient for the Agreeableness scale from the five question items (each item is a 6-point scale). Let’s first load in the data set and then clean up the data:

library(psych)
library(psychmeta)
library(tidyverse)
library(missRanger)

## Clean data
df <- bfi |>
  # select only relevant columns
  dplyr::select(A1,A2,A3,A4,A5) |>
  # reverse scored items
  mutate(A1 = 6-A1) |>
  # impute
  imputeUnivariate()

Using the alpha function from the psych package we can compute the standardized and unstandardized Coefficient Alpha. Note that the argument check.keys=TRUE allows the function to automatically flip reversed scored items, however it is best to first flip the items to the desired direction instead of allowing the alpha() function to try to detect reversed items.

# compute reliability
reliability <- psych::alpha(df)

# only display the total scale reliabilities
reliability$total[1:2] 
 raw_alpha std.alpha
 0.6980189 0.7077484

The reliability for the sample is estimated to be \(\rho_{\widetilde{X}\widetilde{X'}}=.70\) for the unstandardized (raw) alpha coefficient and \(\rho_{\widetilde{X}\widetilde{X'}}=.71\) for the standardized alpha coefficient.

If the raw data is not available to us but the following correlation matrix table (Table 5.1) is we can use the standardized formula of alpha to calculate the reliability.

Table 5.1: item correlation matrix
tinytable_1u9cdl79ipzt3i7jvjut
Item A1 A2 A3 A4 A5
A1 1.00 0.33 0.26 0.14 0.18
A2 0.33 1.00 0.47 0.33 0.38
A3 0.26 0.47 1.00 0.36 0.50
A4 0.14 0.33 0.36 1.00 0.31
A5 0.18 0.38 0.50 0.31 1.00

From this correlation matrix we can use Equation 5.8 to calculate the standardized alpha coefficient. In R, the compute_alpha() function in the psychmeta package (Dahlke and Wiernik 2019) takes in a correlation matrix (or a covariance matrix) as an argument.

corr_matrix <- cor(df)
# correlation matrix from example table:
# 1
# .34 1
# .26 .48 1
# .14 .33 .36 1
# .18 .39 .50 .31 1

compute_alpha(sigma = corr_matrix)
[1] 0.7077484

As expected, the reliability is identical to the standardized coefficient alpha computed from the raw data.

Mosier’s Reliability

Mosier’s (1943) reliability takes into account the within-sub-measurement reliability and weights formulation allows us to take the sub-measurement-level reliability as well as non-unit weights into account. In matrix notation,

\[ \rho_{\widetilde{X}\widetilde{X}'} = \frac{\mathbf{w}^\intercal(\mathbf{R}_\mathsf{\tilde{x}} - \mathbf{I} + \boldsymbol{\rho}_{\mathsf{\tilde{x}}\mathsf{\tilde{x}}'}\mathbf{I} )\mathbf{w}}{\boldsymbol{\mathbf{w}^\intercal\Sigma}_\mathsf{x}\mathbf{w} }. \]

To describe the numerator: the correlation matrix between observed sub-scores \(\mathbf{R}_\mathsf{\tilde{x}}\) is subtracted by a square \(\kappa\times\kappa\) identity matrix \(\mathbf{I}\) (1s in the diagonal, 0s in the off-diagonals) in effect removing the diagonal 1s in the correlation matrix, then the diagonal is replaced with \(\boldsymbol{\rho}_{\mathsf{\tilde{x}}\mathsf{\tilde{x}}'}\mathbf{I}\) which is a an identity matrix where the diagonals are now the within-sub-measurement reliability coefficients. The denominator just contains the observed sub-score correlation matrix \(\mathbf{R}_\mathsf{\tilde{x}}\). The vector of weights \(\mathbf{w}\) acts as a weighted summation for the numerator and denominator. If the weights are simply unit weights, then the formula simplifies to,

\[ \rho_{\widetilde{X}\widetilde{X}'} = \frac{\bar{\rho}_{\mathsf{\tilde{x}} \mathsf{\tilde{x}}'} + (\kappa - 1)\bar{\rho}_{\mathsf{\tilde{x}}_q \mathsf{\tilde{x}}_j}}{1+(\kappa-1)\bar{\rho}_{\mathsf{\tilde{x}}_q \mathsf{\tilde{x}}_j}}. \]

Where \(\bar{\rho}_{\mathsf{\tilde{x}} \mathsf{\tilde{x}}'}\) is the average reliability across all \(q\) sub-measurements. Mosier’s reliability, as opposed to Coefficient alpha, is an upper bound. One of the potential issues with using coefficient alpha for effect size corrections is that it assumes that the correlations between the true sub-scores for each sub-measurement are perfect, \(\rho_{\mathsf{\tilde{x}}_q \mathsf{\tilde{x}}_j}=1\). The problem is that if the true scores do not correlate perfectly, than there may be true sub-score variance that is not captured by alpha. Therefore Mosier’s reliability uses the true variance for each sub-measurement by including it’s reliability. An example of this dilemma: if we want to estimate the reliability of the full-scale IQ score from a battery of cognitive tests like working memory, visuo-spatial reasoning, vocabulary knowledge, etc. then estimating coefficient alpha from the inter-correlations between these sub-test scores will index the true variance that is shared between working memory, reasoning, and knowledge without considering the true variance that is unique to each sub-test. Luckily these subtests usually consist of sub-measurements themselves (i.e., test items) and therefore calculating reliability for each sub-test is not a problem. However, this can become problematic in the case of Likert items where it is quite reasonable to suspect that there is true sub-score variance that is independent, it is often impossible to estimate the reliability.

Mosier’s Reliability in R

The psychTools package (William Revelle 2024) contains data on the cognitive abilities from the (ICAR) data base. There are four sub-tests consisting of basic reasoning, letter series, matrix reasoning, and spatial rotation. Let’s first load in the data set and impute missing values using predictive mean matching (Mayer 2023):

library(psychTools)
library(psychmeta)
library(tidyverse)
library(missRanger)

## Clean data
df <- as.data.frame(ability) |>
  # impute missing values multivariately
  missRanger(verbose=F,pmm.k = 2)

Using the compute_composite_matrix function from the psychmeta package (Dahlke and Wiernik 2019) we can compute Mosier’s reliability. First we need to compute the coefficient alpha for each subtest:

# initialize empty vector
reliability <- c()

# compute alpha for reasoning test
reliability["reasoning"] <- compute_alpha(
  sigma = cov(df[,ability.keys$reasoning]),
  standardized = FALSE)

reliability["letters"] <- compute_alpha(
  sigma = cov(df[,ability.keys$letters]),
  standardized = FALSE)

reliability["matrix"] <- compute_alpha(
  sigma = cov(df[,ability.keys$matrix]),
  standardized = FALSE)

reliability["rotate"] <- compute_alpha(
  sigma = cov(df[,ability.keys$rotate]),
  standardized = FALSE)

reliability
reasoning   letters    matrix    rotate 
0.6419589 0.6702677 0.5484750 0.7625838 

The reliabilities for each test are poor, but our goal is to create a composite full scale IQ score from this test which should be more reliable than each of the sub-tests by themselves. Let’s compute the composite scores:

composite_scores <-  df |>
  # compute subtest scores
  mutate(reasoning = reason.4 + reason.16 + reason.17 + reason.19,
         letters = letter.7 + letter.33 + letter.34 + letter.58,
         matrix = matrix.45 + matrix.46 + matrix.47 + matrix.55,
         rotate = rotate.3 + rotate.4 + rotate.6 + rotate.8) |>
  # keep only the composite scores
  dplyr::select(reasoning, letters, matrix, rotate) |>
  # compute IQ scores
  mutate(IQ = reasoning + letters + matrix + rotate)

head(composite_scores)
   reasoning letters matrix rotate IQ
5          0       1      1      0  2
6          1       2      0      1  4
7          2       1      2      0  5
8          1       1      1      0  3
9          3       1      2      0  6
10         4       4      3      3 14

Then we can compute mosier’s reliability for IQ scores using the composite_rel_matrix() function in the psychmeta package:

# create subtest data frame with just subtests (w/ IQ)
subtests <- composite_scores |>
  dplyr::select(reasoning,letters,matrix,rotate)

# correlation matrix between subtest scores
corr_matrix <- cor(subtests)

# create a vector of 1s to get standardized reliability
sds <- rep(1,4)

# compute Mosier's reliability
composite_rel_matrix(rel_vec = reliability,
                     r_mat = corr_matrix,
                     sd_vec = sds)
[1] 0.8498067

As expected, the reliability is much larger for the full scale IQ scores than the individual subtests. Let’s briefly compare this value with coefficient alpha which does not take into account the reliability of each subtest:

compute_alpha(cor(subtests))
[1] 0.7514907

The reliability is much lower. This is because the total true score variance does not include the true score variance that is exclusive to each subtest.

5.3.2 Coefficient of Stability

Transient errors represent fluctuations in observed scores over time. These fluctuations, even if they are systematic (e.g., fatigue over the course of a single day), add extraneous variance that can mask individual differences. Considering transient fluctuations as error depends on the research goal, so it is important for researchers to take care in considering which types of sources of variance that should be considered error in their study. A coefficient of stability (also referred to as test-retest reliability) is simply the correlation between observed scores upon repeated measurements. Note that calculating the Pearson correlation coefficient between time-points ignores systematic changes (e.g., practice effects). We can visualize test-retest reliability in Figure 5.2.

Figure 5.2: Illustrating test-retest reliability. Top-left and top-right panels show the correlation between observed scores at both time-points for a measure that has low and high reliability, respectively. Bottom-left and bottom-right panels show the within-person change from time-point 1 to time-point 2 for scores with low and high reliability, respectively.

5.3.3 Alpha Coefficient of Stability

If we want to capture both transient error and sub-measurement error variance, then we can also calculate the alpha coefficient of stability (also known as the test-retest alpha). This reliability estimate is the average split half correlation across two time points. Specifically, the procedure is as follows,

  1. Split the set of sub-measurements for each time point into a two halves where both time points have the same split.

  2. Then calculate the split half reliability using Equation 5.9 or Equation 5.10 between the composite of first half from time point 1 and second half from time point 2.

  3. Repeat step 1 and 2 for all possible halves and the average is the alpha coefficient of stability.

5.4 Correcting Correlations

5.4.1 Defining the Target Correlation

The target correlation for this section can be defined as the population correlation between between true scores, \(\rho_{XY}\). This can be related to observed sample correlation between observed proxy scores via the relation,

\[ r_{\widetilde{X}\widetilde{Y}} = \alpha \rho_{XY} + \varepsilon_r, \] where \(\alpha\) is the artifact attenuation/inflation factor and \(\varepsilon_r\) is the the sampling error term.

5.4.2 Artifact Correction for Correlations

Classical measurement error induces systematic bias in effect size estimates such as correlation coefficients C. Spearman (1904). In the population, let us assume there is some factor \(\alpha\) that accounts for the systematic bias in observed score correlations (\(\rho_{\widetilde{X}\widetilde{Y}}\)) relative to true score correlations (\(\rho_{XY}\)), such that

\[ \rho_{\widetilde{X}\widetilde{Y}} = \alpha \rho_{XY}. \tag{5.11}\]

Since the correlation is defined as the covariance standardized by the standard deviations, the population correlation between true scores, \(T\) and \(U\), is defined as,

\[ \rho_{XY}=\frac{\sigma_{XY}}{\sigma_{X} \sigma_{Y}}. \tag{5.12}\]

Likewise the correlation between the observed scores, \(X\) and \(Y\), would be the observed covariance divided by the observed standard deviations,

\[ \rho_{\widetilde{X}\widetilde{Y}} =\frac{\sigma_{\widetilde{X}\widetilde{Y}}}{\sigma_\widetilde{X} \sigma_\widetilde{Y}}. \]

However, if we assume that there is no covariance between errors \(\sigma_{E_X E_Y} = 0\), then the covariance between observed scores is only attributable to the covariance between true scores, therefore \(\sigma_{\widetilde{X}\widetilde{Y}} = \sigma_{XY}\). This means that the observed score correlation can be expressed with the true score covariance in the numerator,

\[ \rho_{\widetilde{X}\widetilde{Y}} =\frac{\sigma_{XY}}{\sigma_\widetilde{X} \sigma_\widetilde{Y}}. \tag{5.13}\]

Now the only difference between the observed score correlation (Equation 5.13) and the true score correlation (Equation 5.12) is the standard deviations in the denominator. In the presence of measurement error, the observed score standard deviations (\(\sigma_\widetilde{X}\) and \(\sigma_\widetilde{Y}\)) will be larger than the true score standard deviations (\(\sigma_X\) and \(\sigma_Y\)). Using the definition of reliability, we can show how the observed standard deviation is inflated compared to the true variance as a function of reliability. Since the reliability is defined as the ratio of true variance to total observed variance (see Equation 5.4), we can see how reliability inflates the observed variance,

\[\begin{aligned} \sigma^2_\widetilde{X} &=\sigma^2_{X} \left(\frac{\sigma^2_\widetilde{X}}{\sigma^2_X} \right)\\[.3em] &= \sigma^2_X\left(\frac{1}{\rho_{\widetilde{X}\widetilde{X}'}} \right)\\[.3em] &= \frac{\sigma^2_{T}}{\rho_{\widetilde{X}\widetilde{X}'}}. \end{aligned}\]

Therefore the observed standard deviation is,

\[ \sigma_\widetilde{X} = \frac{\sigma_{X}}{\sqrt{\rho_{\widetilde{X}\widetilde{X}'}}}. \tag{5.14}\]

Since the reliability, and its square root, will be less than 1, the observed score variance will be larger than the true score variance \(\sigma_X \geq \sigma_\widetilde{X}\). If we use the definition of an observed score correlation in Equation 5.13, then we can replace \(\sigma_\widetilde{X}\) and \(\sigma_\widetilde{Y}\) with \(\frac{\sigma_X}{\sqrt{\rho_{\widetilde{X}\widetilde{X}'}}}\) and \(\frac{\sigma_Y}{\sqrt{\rho_{\widetilde{Y}\widetilde{Y}'}}}\), respectively. Now we can show mathematically, how the observed score correlation differs from the true score correlation:

\[\begin{aligned} \rho_{\widetilde{X}\widetilde{Y}} &= \frac{\sigma_{XY}}{\left[\frac{\sigma_X}{\sqrt{\rho_{\widetilde{X}\widetilde{X}'}}} \right] \left[ \frac{\sigma_Y}{\sqrt{\rho_{\widetilde{Y}\widetilde{Y}'}}} \right] } \\[.3em] &= \sqrt{\rho_{\widetilde{X}\widetilde{X}'}}\sqrt{\rho_{\widetilde{Y}\widetilde{Y}'}}\,\times\frac{\sigma_{XY}}{\sigma_X\sigma_Y} \\[.3em] &= \underset{\alpha}{\underbrace{\sqrt{\rho_{\widetilde{X}\widetilde{X}'}}\sqrt{\rho_{\widetilde{Y}\widetilde{Y}'}}} } \times \rho_{XY} \end{aligned}\]

Therefore the attenuation factor described in Equation 5.11 is,

\[ \alpha = \sqrt{\rho_{\widetilde{X}\widetilde{X}'}}\sqrt{\rho_{\widetilde{Y}\widetilde{Y}'}} \tag{5.15}\]

This attenuation formula was first derived by Charles Spearman (1904). This derivation. See Figure 5.3 for a visualization of the attenuation.

Figure 5.3: Visualizing the attenuation of an observed score correlation (\(\rho_{XY}\)) due to measurement error. Top left panel shows a scatter plot of the correlation between the true scores of the independent and dependent variable. Top right panel shows a scatter plot of the correlation between the observed scores of the independent and dependent variable. The bottom left panel shows the reliability of a single variable and it’s relationship with the observed score correlation (\(\rho_{XY}\)) while varying the true score correlation (true score correlations are represented as the red dots, i.e., when reliability is perfect). The bottom right panel shows attenuation of a a true score correlation, when both \(X\) and \(Y\) are affected by measurement error. The darker red indicates a larger observed correlation, where it peaks at an observed score correlation of .50 (when reliability of \(X\) and \(Y\) is 1) which is equivalent to the true score correlation.

Let us recall that we can express the study sample correlation between observed scores as a function of the true score population correlation (i.e., the target correlation, \(\rho_{XY}\)),

\[ r_{\widetilde{X}\widetilde{Y}} = \alpha\rho_{XY} + \varepsilon_r. \]

Now that we have derived the attenuation factor \(\alpha\) we can plug in it’s actual form from Equation 5.15,

\[ r_{\widetilde{X}\widetilde{Y}} = \left(\sqrt{\rho_{\widetilde{X}\widetilde{X}'}}\sqrt{\rho_{\widetilde{Y}\widetilde{Y}'}}\right) \rho_{XY} + \varepsilon_r. \]

We could use the \(\alpha\) coefficient to correct the correlation by dividing both sides of the above equation, however the population reliabilities \(\rho_{\widetilde{X}\widetilde{X}'}\) and \(\rho_{\widetilde{Y}\widetilde{Y}'\) are unknown in practice and therefore we have to use sample estimates. Since we must use sample estimates of reliability, then that means we only have access to a sample estimate of \(\alpha\) which we denote with the English letter \(a\). Therefore we can correct the observed score study correlation by dividing by an estimate the attenuation factor using,

\[ r_{XY} = \frac{r_{\widetilde{X}\widetilde{Y}}}{a} = \frac{r_{XY}}{\sqrt{r_{\widetilde{X}\widetilde{X}'}}\sqrt{r_{\widetilde{Y}\widetilde{Y}'}}} \]

The corrected correlation coefficient is a consistent estimator of the target correlation, \(\rho_{XY}\).

Correcting Correlations in R

A study reports the correlation between mean reaction time and exam grades of \(r_{\widetilde{X},\widetilde{Y}}=.40\)

library(psychmeta)

rXY <- .40  # correlation between X and Y
rXX <- .80  # reliability of X
rYY <- .70  # reliability of Y
n <- 100    # sample size

# correct correlation
correct_r(correction = 'meas',
          rxyi = rXY,
          rxx = rXX,
          ryy = rYY)
Correlations Corrected for Measurement Error:
---------------------------------------------------------------------------------------
  rxy   rxp   rty   rtp
1 0.4 0.478 0.447 0.535

In the psychmeta package, they show the correction for each variable therefore rxy is the observed correlation, rxp is corrected for the measurement error in \(Y\), rty is corrected for measurement error in \(X\), and rtp is the correlation corrected for measurement error in both variables. Therefore sample estimate of the true score correlation is \(r_{XY} = .535\).

Sampling Variance of Corrected Correlation

The next step is to obtain the sampling variance of the corrected correlation coefficient. The sampling variance of the uncorrected correlation, \(\mathrm{var}(r_{\widetilde{X}\widetilde{Y}})\), can be adjusted directly. If we treat the reliability coefficients as fixed as the correlation, then the standard error formula can be calculated simply as,

\[ \mathrm{var}(r_{XY}) = \frac{\mathrm{var}(r_{\widetilde{X}\widetilde{Y}})}{r_{XX'}r_{YY'}} \]

This formula generally performs pretty well, however there are two additional methods that are preferrable: Bootstrapping and Taylor series approximations. If the raw data is available, then a sampling with replacement procedure (i.e., bootstrapping) can provide a non-parametric estimate of the sampling variance.

Bootstrapping Corrected Correlations in R

As an example in R, we can calculate the sampling variance between Extraversion and Agreeableness sum scores from the bfi data set in the psych package (William Revelle 2023).

library(tidyverse)
library(psychmeta)
library(psych)
library(missRanger)

set.seed(1)

# clean data and compute composites
df <- bfi |>
  dplyr::select(A1,A2,A3,A4,A5,
                E1,E2,E3,E4,E5) |>
  mutate(A1 = 6-A1, E1 = 6-E1, E2 = 6-E2) |>
  mutate(A = A1 + A2 + A3 + A4 + A5,
         E = E1 + E2 + E3 + E4 + E5) |>
  imputeUnivariate()

# initialize corrected correlation vector
rAE_corrected <- c()

# loop through 1500 bootstrap iterations
for(i in 1:1500){
  
  # sample the data with replacement
  boot = sample(1:nrow(df),nrow(df),replace = TRUE)
  boot_df <- df[boot,]
  
  # compute bootstrap correlation matrix for each trait
  corr_mat_A <- cor(boot_df[,c("A1","A2","A3","A4","A5")])
  corr_mat_E <- cor(boot_df[,c("E1","E2","E3","E4","E5")])

  # compute reliability
  rAA <- compute_alpha(corr_mat_A)
  rEE <- compute_alpha(corr_mat_E)
  
  # compute correlation between composites
  rAE <- cor(boot_df$A, boot_df$E)
  
  # correct correlation
  rAE_corrected[i] <- rAE / sqrt(rAA*rEE)
}

# bootstrapped variance estimate
var(rAE_corrected)
[1] 0.0004839236
## compare bootstrapped variance with closed form
var_observed <- (1 - cor(df$A,df$E)^2)^2 / (nrow(df)-1)

rAA <- compute_alpha(cor(df[,c("A1","A2","A3","A4","A5")]))
rEE <- compute_alpha(cor(df[,c("E1","E2","E3","E4","E5")]))

var_corrected <- var_observed / (rAA * rEE)

# closed form variance estimate
var_corrected
[1] 0.0004384132

In this example, the bootstrapped sampling variance and the closed form are nearly equal.

Bobko and Rieck (1980) derived sampling variance formulas from Taylor series approximations for three different situations. However Bobko and Rieck (1980) only derived them for correcting the correlation only for the dependent variable \(Y\). Using the same methodology as Bobko and Rieck (1980) I derive the variance of the corrected correlation in the case where reliability estimates and correlations come from an independent samples (e.g., reliability comes from a test manual and not from the study sample). The formula requires the sample size for the study sample \(n\) as well as for the samples where the reliability coefficients were estimated \(n_{r_{\widetilde{X}\widetilde{X}'}}\) and \(n_{r_{\widetilde{Y}\widetilde{Y}'}}\).

\[\begin{align} \mathrm{var}(r_{XY}) =& \frac{se(r_{\widetilde{X}\widetilde{Y}})^2}{r_{\widetilde{X}\widetilde{X}'}r_{\widetilde{Y}\widetilde{Y}'}} \\ + &\frac{r^2_{\widetilde{X}\widetilde{Y}}\left(1-r^2_{\widetilde{X}\widetilde{X}'}\right)^2}{4(n_{r_{\widetilde{X}\widetilde{X}'}}-1)r_{\widetilde{X}\widetilde{X}'}^3r_{\widetilde{Y}\widetilde{Y}'}} \\ &+\frac{r^2_{\widetilde{X}\widetilde{Y}}\left(1-r^2_{\widetilde{Y}\widetilde{Y}'}\right)^2}{4(n_{r_{\widetilde{Y}\widetilde{Y}'}}-1)r_{\widetilde{X}\widetilde{X}'}r_{\widetilde{Y}\widetilde{Y}'}^3} \end{align} \tag{5.16}\]

The full derivation can be found seen in the Appendix of this chapter.

5.4.3 Differential Measurement Error

Before we move on to correcting standardized mean differences, I have to confess, I snuck in a major assumption earlier in this section:

However, if we assume that there is no covariance between errors \(\sigma_{E_X E_Y} = 0\), then the covariance between observed scores is only attributable to the covariance between true scores, therefore \(\sigma_{\widetilde{X}\widetilde{Y}} = \sigma_{XY}\).

The problem with this assumption is that the corrected correlation can be under-estimated if \(\sigma_{E_XE_Y}<0\) or over-estimated if \(\sigma_{E_XE_Y}>0\). A conservative correction approach would be to use Mosier’s reliability in the correction procedure allowing there to be true variance which will reduce the likelihood that the correction is an over-estimation. A better correction procedure would be to use the one introduced by Donald W. Zimmerman and Williams (1977),

\[ r_{XY} = \frac{r_{\widetilde{X}\widetilde{Y}}}{\sqrt{r_{\widetilde{X}\widetilde{X}}}\sqrt{r_{\widetilde{Y}\widetilde{Y}}}} -\frac{r_{E_XE_Y} \sqrt{1-r_{\widetilde{Y}\widetilde{Y}}}\sqrt{1-r_{\widetilde{Y}\widetilde{Y}}}}{\sqrt{r_{\widetilde{X}\widetilde{X}}}\sqrt{r_{\widetilde{Y}\widetilde{Y}}}} \tag{5.17}\] Notice that the first term is identical to the correction formula described earlier, in fact, if the \(r_{E_XE_Y}\) is set to zero, we would recover the original correction formula. The problem with Equation 5.17 is that it requires an estimate of \(r_{E_XE_Y}\) which is only attainable if the raw data is available, since this would rarely be reported directly.

5.5 Correcting Standardized Mean Differences (SMD)

5.5.1 Defining the Target SMD

Prior to correcting for measurement error let us define our target effect size. Our target effect size is the SMD between groups 0 and 1 with respect to the true scores, \(Y\), of our dependent variable,

\[ \delta_{GY} = \frac{\mu_{Y|G=1} - \mu_{Y|G=0}}{\sigma_{Y|G}}. \]

Where \(\mu_{Y|G=1}\) and \(\mu_{Y|G=0}\) is the conditional population mean for each group (\(G\)). If we are given a sample observed score SMD, then we can express the relationship between the population true score SMD (\(\delta_{GY}\)) and our observed SMD with our effect size model:

\[ d_{GY} = \alpha\delta_{GY}+\varepsilon_d. \]

Where \(\alpha\) is the attenuation factor induced by measurement error and \(\varepsilon_d\) denotes sampling error.

5.5.2 Artifact Correction for SMDs

The population mean of true scores and observed scores are identical since classical measurement error, as defined here, only affects the variance in scores. Therefore the population mean difference is equivalent for true scores and observed scores,

\[ \underset{\textsf{observed score mean diff}}{\underbrace{\mu_{\widetilde{Y}|G=1} - \mu_{\widetilde{Y}|G=0}}}= \underset{\textsf{true score mean diff}}{\underbrace{\mu_{Y|G=1} - \mu_{Y|G=0}}}. \]

This means that an unstandardized mean difference is not biased by classical measurement error. Taking advantage of this fact, we can express an observed score SMD with replacing the observed score mean difference in the numerator with the true score mean difference such that,

\[ \delta_{G\widetilde{Y}} = \overset{\textsf{observed score mean diff}}{\overbrace{\frac{\mu_{\widetilde{Y}|G=1} - \mu_{\widetilde{Y}|G=0}}{\sigma_{\widetilde{Y}|G}}}} = \overset{\textsf{true score mean diff}}{\overbrace{\frac{ \mu_{Y|G=1} - \mu_{Y|G=0}}{\sigma_{\widetilde{Y}|G}}}} \tag{5.18}\]

Similar to Equation 5.14, the observed score within-group standard deviation \(\sigma_{\widetilde{Y}|G}\) can be expressed in terms of the true score within-group standard deviation \(\sigma_{Y|G}\) and the within-group reliability \(\rho_{\widetilde{Y}\widetilde{Y}'|G}\),

\[ \sigma_{\widetilde{Y}|G}=\frac{\sigma_{Y|G}}{\sqrt{\rho_{\widetilde{Y}\widetilde{Y}'|G}}} \]

We can then plug this in for \(\sigma_{\widetilde{Y}|G}\) in the observed score SMD formula (Equation 5.18) which will allow us to recover the true score SMD and parse out the artifact attenuation coefficient \(\alpha\),

\[\begin{aligned} \delta_{G\widetilde{Y}} &=\frac{\mu_{Y|G=1} - \mu_{Y|G=0}}{\sigma_{\widetilde{Y}|G}} \\[.3em] &= \frac{\mu_{Y|G=1} - \mu_{Y|G=0}}{\left[\frac{\sigma_{Y|G}}{\sqrt{\rho_{\widetilde{Y}\widetilde{Y}'|G}}}\right]} \\[.3em] &= \underset{\alpha}{\underbrace{\sqrt{\rho_{\widetilde{Y}\widetilde{Y}'|G}}} } \times \delta_{GY}. \end{aligned}\]

Therefore the artifact attenuation factor \(\alpha\) is,

\[ \alpha = \sqrt{\rho_{\widetilde{Y}\widetilde{Y}'|G}}. \tag{5.19}\]

For a given sample, the within-group reliability and thus the artifact attenuation factor must be estimated. The sample estimate of the within-group reliability can be estimated from pooling the reliability in group 0 (\(\rho_{\widetilde{Y}\widetilde{Y}'|G=0}\)) and group 1 (\(\rho_{\widetilde{Y}\widetilde{Y}'|G=1}\)),

\[ r_{\widetilde{Y}\widetilde{Y}'|G} = \frac{(n_0-1)r_{\widetilde{Y}\widetilde{Y}'|G=0} + (n_1-1) r_{\widetilde{Y}\widetilde{Y}'|G=1}}{n -2}, \] where \(n = n_0 + n_1\). Using Equation 5.19 we can correct the SMD with the sample estimate \(a=\sqrt{r_{\widetilde{Y}\widetilde{Y}'|G}}\),

\[ d_{GY} = \frac{d_{G\widetilde{Y}}}{a} = \frac{d_{G\widetilde{Y}}}{\sqrt{r_{\widetilde{Y}\widetilde{Y}'|G}}}. \] See Figure 5.4 for an illustration of how SMDs are affected by classical measurement error.

Figure 5.4: Visualizing the attenuation of the standardized mean differences. Even though the means (dashed lines) do not change in the true score (top plot, in blue) and observed score distributions (bottom plot, in red), the true score distributions have a narrower variance than the observed score distributions. Since the variance and thus standard deviations are larger in the observed scores, the SMD in the observed scores is smaller than the true scores.

Sampling Variance of Corrected SMD

The sampling variance can be calculated similarly to the corrected correlation. The best if the raw data is available is to compute the sampling variance from a bootstrapping procedure (see the R example on bootstrapping in the previous section on correlations). If we treat the within-group reliability as fixed (or assumed to be known a priori) then the corrected sampling variance is,

\[ \mathrm{var}(d_{GY}) = \frac{\mathrm{var}(d_{GY})}{r_{\widetilde{Y}\widetilde{Y}'|G}} \]

The last option for situations where the reliability and SMD are estimated from different samples the Taylor series approximation could be applied,

\[ \mathrm{var}(d_{GY}) = \frac{\mathrm{var}(d_{G\widetilde{Y}})}{r_{\widetilde{Y}\widetilde{Y}'|G}} +\frac{d^2_{gY}\left(1-r^2_{\widetilde{Y}\widetilde{Y}'|G}\right)^2}{4\left(1-n_{r_{\widetilde{Y}\widetilde{Y}'|G}}\right)r^3_{\widetilde{Y}\widetilde{Y}'|G}}. \tag{5.20}\]

It is common that studies will only report the full sample reliability and not the reliability within each group. If the groups differ substantially on the dependent variable, \(Y\), then the total sample reliability will over-estimate the within-group reliability. When the total sample reliability is all that is available, we can correct the standardized mean difference by first converting the SMD to a point-biserial correlation. To do this we also need the observed proportion of subjects in group 0 or 1 (\(p_G = n_0/n\) or \(p_g = n_1/n\), it does not matter which one is chosen). This three-step process is as follows:

  1. Convert the SMD to a point-biserial correlation: \[ r_{G\widetilde{Y}} = \frac{d_{G\widetilde{Y}}}{\sqrt{\frac{1}{p_G(1-p_G)} +d^2_{G\widetilde{Y}}} } \]

  2. Correct the point-biserial correlation: \[ r_{GY} = \frac{r_{G\widetilde{Y}} }{\sqrt{r_{\widetilde{Y}\widetilde{Y}'}}} \]

  3. Convert back to an SMD: \[ d_{GY} = \frac{r_{GY}}{\sqrt{p_G(1-p_G)(1-r^2_{GY})}} \]

With this correction procedure, the sampling variance must also go through the same three-step procedure. Treating \(a\) as fixed and stuffing the three-step procedure into a single formula we get,

\[ \mathrm{var}(d_{GY}) = \frac {\mathrm{var}(d_{G\widetilde{Y}}) \times\left(\frac{r_{GY}}{r_{G\widetilde{Y}}}\right)^2} {\left(1+d_{G\widetilde{Y}}^2p_G[1-p_G]\right)^3\left(1-r_{GY}^2\right)^3}. \tag{5.21}\]

Applied Example in R

Imagine we want to test the effectiveness of a drug that is meant to alleviate stress. Therefore we conduct a study and randomize a sample of 150 people into a control group (\(n=75\)) and a treatment group (\(n=75\)). After the intervention, the subjects report their stress levels on a self-report scale. The scale was shown to have a full sample reliability of \(r_{\widetilde{Y}\widetilde{Y}'}=.80\). We can correct the standardized mean differences on observed scores with the correct_d function in the psychmeta package (Dahlke and Wiernik 2019).

library(psychmeta)

# observed values
dGY <- .40
rYY <- .80
n0 <- n1 <- 75

d_correct <- correct_d(correction = 'meas',
                         d = dGY,
                         ryy = rYY,
                         n1 = n0,
                         n2 = n1)

# corrected SMD
d_correct$d_values$dGp$value
[1] 0.4494666

The output of psychmeta shows a slightly larger (\(d_{GY} = 0.45\)). To calculate the sampling variance, we can use the var_error_d() function in the psychmeta package. The correction function not only gave us the corrected \(d\) value but also provided the effective sample size which is adjusted for the correction procedure. The effective sample size can then be inputted as the sample size for var_error_d. This way the sampling variance calculation utilizes the effective sample size rather than the actual sample size which is functionally identical to Equation 5.21:

# compute sampling variance of corrected SMD
var_error_d(d = d_correct$d_values$dGp$value, 
            n1 = d_correct$d_values$dGp$n_effective)
[1] 0.0354002