# Respectful Operationalism, Realism, and Models of Measurement Error in Psychometrics

## Some useful definitions

Let us first note that when I refer to a **measurand** I simply mean the (psychological) attribute that we intend to measure.

A **realist** defines the measurand outside of the measurement procedure, that is, the measurand is some test-independent attribute that is estimated by scores produced by the test.

An **operationalist** would instead suppose that the measurand is defined by the test itself and can not be defined outside of the test (i.e., the measurand is test-dependent). Note that a *strict* operationalist in the form that I discuss in this blog post is a sort of extreme operationalist. In the measurement error modeling section, I also describe a *laxed operationalist*, but it is just a useful descriptor for contrasting two types of operationalist measurement models.

A **respectful operationalist** is an operationalist that defines the measurand in terms of the test, but the test is validated for extra-operational meanings and usability (i.e., meanings and usability beyond the test, Vessonen 2021). So the respectful operationalist is kind of like “sure, yeah we can define the measurand in terms of the test, but the test needs to incorporate common connotations of the target concept and produce usable scores”.

## What might validity look like in each of these frameworks

In **realist** framework, validity may fall under the principles set by Borsboom, Mellenbergh, and Van Heerden (2004), where the psychological attribute (or construct) has to both 1) exist and 2) cause variation in the measurement procedure. This view of validity is centered around construct validity.

A (strict) **operationalist** need not validate a measure, even if the measure is nonsensical, the measurand is still defined by the test.

A **respectful operationalist** can validate a measure using a similar approach to Michael T. Kane’s argument-based approach to validity in which he guides researchers to specify and justify the intended use of the measure. Specifically, Vessonen (2021) proposes that a test is valid if the test incorporates common extra-operational connotations of the target concept. This will be more closely related to test-related types of validity such as content and criterion validity.

## Operationalist and Realist Measurement Error Models

Okay let’s start at the foundation: we observe a test score \(X_p\) for a person, \(p\).

**A strict operationalist** may simply define a person’s *true* score \(T_p\) as the observed score such that, \(T_p = X_p\). Trivially, the observed score can be modeled *as* the true score,

\[ X_p = T_p \tag{1}\]

There is something attractive and so practical about this model, however there are immediate problems of usability. What if we test that person again and they obtain a different score? What if the scoring is done by a rater and a different rater assigns a different score to that person? To answer this, let us use an example where a person’s true score is defined as the observed score \(X_{par}\), for person \(p\), test administration \(a\), and rater \(r\) such that,

\[ T_p = X_{par} \]

However the problem now is that if the observed score varies across test-administrations and/or raters then the true score will not be generalizable across test-administrations or raters,

\[ T_p = X_{par} \neq X_{pa(r+1)} \neq X_{p(a+1)r} \]

This will force us to define a true score for not only each person, but *also* each test administration and rater such that,

\[ T_{par} = X_{par}. \]

However differences in observed scores between test-administration and raters may not be of scientific interest if we only really care about person-level differences. This specific operationalist approach is simply unusable in practice. Although, it is interesting to note, that the reliability of observed scores is necessarily perfect and can be shown by the reliability index, \(\rho_{XT}=1\) (i.e., the correlation between true and observed scores).

A more **laxed operationalist** definition of true scores would be to define it based on the *expectation* of observed scores conditional on a person. Therefore a definition of a true score for a given person can be expressed by,

\[ T_p = \mathbb{E}[X_{par}\mid p] \]

Conditional expectation (\(\mathbb{E}[\cdot]\)) is the average observed score over all possible observed scores for a given person. This also provides the beautiful property that \(X_{p}\) for a given person, is *calibrated* to \(T_p\), since,

\[ T_p - \mathbb{E}[X_p\mid p] = T_p - T_p = 0. \]

So now, even if observed scores vary across raters and/or test administrations, the true score will remain constant. The question at this point is how can we model a single instance of an observed score. Here, we will need to introduce measurement errors. **Measurement errors** can be defined as differences between observed scores and true scores. Remember that in our strict operationalist model (Equation 1) we did not have any measurement errors since the true score was equivalent to the observed score (i.e., if \(T_p = X_p\), then \(T_p - X_p = 0\)). Now that the true score is the conditional expectation of the observed score, a single instance of an observed score *can* differ from the true score. We can thus add a new term to the model described in Equation 1,

\[ X_{p} = T_p + E_{p}, \tag{2}\]

where \(E_{p}\) indicates the error in the observed score for a given person and measurement. Since we started with the definition of true scores being \(T_p = \mathbb{E}[X \mid p]\), an extremely useful consequence of this is that the conditional expectation of measurement errors within a person is zero. We can demonstrate why this is the case with a short derivation (first re-arranging Equation 2 so that \(E_{p}\) is on the left hand side),

\[\begin{align} \mathbb{E}[E_{p}\mid p] &= \mathbb{E}[T_p - X_p\mid p] \\[.3em] &= \mathbb{E}[T_p\mid p] - \mathbb{E}[X_p\mid p] \\[.3em] &= T_p - T_p \\[.3em] &= 0 \\ \end{align}\]

Errors that balance out over all possible observed scores for a given person is what distinguishes the classical test theory model from other measurement models (Kroc and Zumbo 2020). It is important to note that the algebraic formulation of \(X=T+E\) is **not** what defines the classical test theory model, in fact, many measurement error models come in an analogous form. It is the conditional expectations between the components of the model which distinguish classical errors from other measurement error models (e.g., Berkson errors, see Kroc and Zumbo 2020).

**A realist** model may suppose that the true score is a construct score that is an objective value that is defined outside of the measurement operations (the actual value of the construct/attribute being measured; Borsboom and Mellenbergh (2002)). Therefore we can not define true scores in terms of observed scores without knowing how observed scores are calibrated true scores. The realist model for an observed score is superficially identical to Equation 2,

\[ X_{p} = T_p + E_{p}, \tag{3}\]

with the major difference being that \(\mathbb{E}[X|p] \neq T_p\). Note that Equation 3 is *not* the classical test theory model because it does not, by default, meet the assumptions of classical test theory that the laxed operationalist model does (Equation 2). The big difference between this realist model the laxed operationalist model is that the conditional expectation of errors is no longer zero in the realist approach, and thus systematic errors (i.e., biased estimates of true scores) can exist. In order to recover proper calibration of the observed scores to true scores, we would have to just assume that \(\mathbb{E}[E_p\mid p]=0\) (see flaws below).

When it comes to psychological constructs the realist approach has two major flaws to me.

Necessitates the existence of a construct score that is independent of the measure. Without sufficient evidence of a construct existing, we are adding additional complexity to our theory. Therefore it adds increased complexity into our theory.

Even if the construct is real and has a true construct score, we have no idea how the observed scores are calibrated to those true scores. The operationalist model produces a calibrated measure by definition whereas the realist model needs to add an additional assumption (i.e., conditioned on a given person, the expectation of measurement errors is zero) in order for the observed scores to be calibrated.

The operationalist model may have the biggest flaw so far:

- Since the measurand is defined by the measure itself, you can not really draw any meaningful inferences about anything outside of the measure.

There is a third option though!

## Respectful Operationalist Measurement Error Model

As put by the originator of Respectful Operationalism Elina Vessonen (2021) states,

[Respectful Operationalism] is the view that we may define a target concept in terms of a test, as long as that test is validated to incorporate common connotations of the target concept and the usability of the measure

Therefore we need to identify which observed scores are produced by tests that have been validated for extra-operational connotations for the concept of interest. We generally have inter-subjective agreement about the meanings of concepts like intelligence, anxiety, and depression mean outside of the measurement instrument. We can imagine that items contain content that are more or less related to those inter-subjective meanings. The extent to which the item content is relevant to the concept of interest reflects the item’s **content validity**. The true score can be defined as the conditional expectation of observed scores obtained from tests containing items that are sufficiently valid (this could be done from inter-rater assessment of the conceptual relatedness of items and concepts) can define the true scores. that contain sufficient validity. Let’s define an observed score produced by a sufficiently valid test as \(X_p^\star\). As an example, a measure of depression that produces observed scores from responses to the question, “what is your age?”, would not encompass common connotations of depression and therefore those observed scores would *not* in the set of *valid* observed scores. This new model can be defined as,

\[ X^\star_p = T_p + E_p \]

where

\[ T_p = \mathbb{E}[X^\star_p\mid p] \]

In this way, inferences about concepts can be made from observed scores since they hold inherent relevance by virtue of prior validation of the test’s extra-operational meaning.

## My current stance

Respectful operationalism appears to offer the practical advantages of operationalism such as calibrated measurement outcomes, without the limitation of being unable to draw meaningful inferences beyond the test. To consider myself a realist for a particular attribute/construct, then I would have to be sufficiently convinced of the tenants of construct validity described by Borsboom, Mellenbergh, and Van Heerden (2004): (1) the attribute exists (2) the construct causes variation in outcomes of the measurement procedure. Currently, I am not sufficiently convinced this is the case this is the case for most (if not all) psychological constructs; therefore, I am not comfortable making the commitments necessary to take on a realist position. Therefore, to me, a know-nothin grad student, Respectful Operationalism seems to best align with my current worldview.

Further reading, see these papers by Elina Vessonen (2019, 2021).

## References

*Psychological Review*111 (4): 1061–71. https://doi.org/10.1037/0033-295X.111.4.1061.

*Journal of Mathematical Psychology*98 (September): 102372. https://doi.org/10.1016/j.jmp.2020.102372.

*Philosophy Compass*14 (10): e12624. https://doi.org/10.1111/phc3.12624.

*Theory & Psychology*31 (1): 84–105. https://doi.org/10.1177/0959354320945036.