A Response to Jonathan Haidt’s response

sampling error
artifacts
In this post we will discuss how squared effect sizes such as η2 and R2 are distributed and the implications of this when conducting a meta-analysis
Author

Matthew B. Jané

Published

November 6, 2023

Defining a Squared Effect Size

Sometimes squared effect sizes are used in order to interpret results (e.g., R2, η2). However these squared indices are biased in small sample sizes. To see why this is the case we can describe the mathematically. Generally, lets say we have some observed effect size hi for a sample i. This can be modeled as a function of a population correlation θ and sampling error ei,

hi=θ+ei

Where the expectation of hi (i.e., the average) upon repeated sampling is equal to θ. More importantly for us, the expectation of sampling error ei across repeated samples would be zero. For the expectation of ei to be zero, e would need to be randomly distributed around zero with both positive and negative values. Let us see what happens to the model once we square both sides to obtain h2:

hi2=(θ+ei)2

and thus,

hi2=θ2+2θei+ei2

The squared effect sizes is now no longer an unbiased estimate of theta. The bias in hi2 can be seen by taking the expected value of both sides (note that the Var(ei) is the sampling error variance and E[] is the expected value or the mean across repeated samples):

E[hi2]=E[θ2+2θei+ei2]=θ2+2θ×E[ei]+E[ei2]=θ2+E[ei2]=θ2+(Var[ei]+E[ei]2)=θ2+Var[ei]

Since the sampling error variance is always positive (unless the sample size was infinite…), there will be a positive bias in expected value of hi2. It is important to note that the variance in sampling error (i.e., Var(ei)) will be larger with smaller sample sizes.

Simulated Example

Lets assume that in the population, the squared correlation between predicted and observed outcome (R2) is zero, that is, the independent variables (X1,X2, and X3) do not explain any variance in the dependent variable (Y). Lets simulate two studies, where one study has a sample size of 100 and the other has a sample size of 25. Then lets replicate these studies 5,000 times. In the first figure we will see that the mean value of R2 is not zero even though the population value is. And with the smaller sample sizes within each study (N=25), the mean R2 is larger. The plots below show the distribution of R2 using the ggdist package ().

Code
library(ggplot2)
library(ggdist)
library(ThemePark)
library(patchwork)
library(latex2exp)
set.seed(343)

# sample size
N <- 100

R21 <- R22 <- R23 <- c()
for(i in 1:5000){
  # Obtain simulated scores
  x1 <- rnorm(N,0,1)
  x2 <- rnorm(N,0,1)
  x3 <- rnorm(N,0,1)
  y <- rnorm(N,0,1)
  
  # linear regression model
  mdl <- lm(y ~ x1 + x2 + x3)
  R21[i] <- summary(mdl)$r.squared

  # linear regression model
  mdl <- lm(y ~ x1 + x2 + x3,subset = sample(1:N,25))
  R22[i] <- summary(mdl)$r.squared
}


h1 <- ggplot(data=NULL, aes(R21)) + 
  theme_barbie(barbie_font = FALSE) +
  stat_slabinterval(color = barbie_theme_colors['dark'], 
                    fill = barbie_theme_colors['light']) +
  scale_x_continuous(limits = c(0,.5)) +
  theme(aspect.ratio = 1, 
        panel.background = element_rect(fill = '#f1f9fa',color = '#f1f9fa'),
        plot.background = element_rect(fill = '#f1f9fa',color = '#f1f9fa'),
        )+
  labs(y = "",x=TeX("$R^2$"), 
       subtitle =  TeX(paste0('N=', 100, ', Ave $R^2$=', round(mean(R21),2))))

h2 <- ggplot(data=NULL, aes(R22)) + 
  theme_barbie(barbie_font = FALSE) +
  stat_slabinterval(color = barbie_theme_colors['dark'], 
                    fill = barbie_theme_colors['light']) +
  scale_x_continuous(limits = c(0,.5)) +
  theme(aspect.ratio = 1, 
        panel.background = element_rect(fill = '#f1f9fa',color = '#f1f9fa'),
        plot.background = element_rect(fill = '#f1f9fa',color = '#f1f9fa'),
        )+
  labs(y = "",x=TeX("$R^2$"), 
       subtitle = TeX(paste0('N=', 25, ', Ave $R^2$=', round(mean(R22),2))))
  
h1 + h2 & plot_annotation(theme = theme(plot.background = element_rect(color="#f1f9fa",fill ="#f1f9fa")))

If we were to run a meta-analysis on all 5,000 studies and found a meta-analytic average R2 of .13, we may conclude that there is a meaningful prediction/effect, however, this could simply be an artifact of sampling error. The plot below shows how the mean R2 changes as you vary the sample size.

Code
set.seed(343)

# sample size
N <- 100

R2 <- N <- c()
for(i in 1:5000){
  N[i] <- runif(1,10,70)
  # Obtain simulated scores
  x1 <- rnorm(N[i] ,0,1)
  x2 <- rnorm(N[i] ,0,1)
  x3 <- rnorm(N[i] ,0,1)
  y <- rnorm(N[i] ,0,1)
  # linear regression model
  mdl <- lm(y ~ x1)
  R2[i] <- summary(mdl)$r.squared
}


ggplot(data=NULL, aes(N,R2)) + 
  theme_barbie(barbie_font = FALSE) +
  stat_smooth( color = barbie_theme_colors['dark']) + 
  theme(aspect.ratio = 1, 
        panel.background = element_rect(fill = '#f1f9fa',color = '#f1f9fa'),
        plot.background = element_rect(fill = '#f1f9fa',color = '#f1f9fa'),
        )+
  labs(y = TeX("Mean $R^2$"),x="Sample Size", title = "")+ 
  expand_limits(y = 0)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Conclusion

In a meta-analysis, it is always best to use directional effect sizes that can be both positive and negative such as Cohen’s d’s or Pearson correlations. There are other strategies for meta-analyzing R2 values such as using individual person data meta-analysis (using the raw data for each study) or meta-analyzing the correlation matrix between predictors and outcomes first and then running the regression model on the meta-analytic correlation matrix (this would not allow for non-linear terms though). If it is necessary to meta-analyze R2 you will want to use the adjusted R2 value. See the plot below that shows the adjusted R2 across different sample sizes

Code
set.seed(343)

# sample size
N <- 100

R2 <- N <- c()
for(i in 1:5000){
  N[i] <- runif(1,10,70)
  # Obtain simulated scores
  x1 <- rnorm(N[i] ,0,1)
  x2 <- rnorm(N[i] ,0,1)
  x3 <- rnorm(N[i] ,0,1)
  y <- rnorm(N[i] ,0,1)
  # linear regression model
  mdl <- lm(y ~ x1)
  R2[i] <- summary(mdl)$adj.r.squared
}


ggplot(data=NULL, aes(N,R2)) + 
  theme_barbie(barbie_font = FALSE) +
  stat_smooth( color = barbie_theme_colors['dark']) + 
  theme(aspect.ratio = 1, 
        panel.background = element_rect(fill = '#f1f9fa',color = '#f1f9fa'),
        plot.background = element_rect(fill = '#f1f9fa',color = '#f1f9fa'),
        )+
  labs(y = TeX("Mean adj. $R^2$"),x="Sample Size", title = "")+ 
  ylim(c(-.4,.4)) +
  expand_limits(y = 0)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'