When Evidence Synthesis Obscures: A Critique of Meta-Meta-Analyses
The problems in plain english
Stacking flaws on flaws
The foundational problem with meta-meta-analyses is that they stack methodological vulnerabilities at two levels: the primary studies and the meta-analyses themselves. When you synthesize meta-analyses quantitatively, you are not working with clean, raw data–you are inheriting the full scope of methodological baggage of every meta-analysis you include.
This means your summary estimate is downstream of two levels of potential error: study-level errors (e.g., flawed study designs, biased outcome reporting, inadequate controls, miscalculations, etc.) and meta-analytic error (compounded by meta-analytic decisions, inclusion/exclusion criteria, modeling choices, effect size miscalculations, study coding errors, etc.). It’s not just “garbage in, garbage out”—it’s garbage in, then reprocessed and repackaged by another potentially flawed process, and then aggregated again. Think of it like a synthetic CDO (CDO²) from the 2008 financial crisis. Risky subprime mortgages were bundled into CDOs, which were then bundled again into even more complex financial products—each layer obscuring the original risk (see the following substack article by James Heathers that I took this analogy from). Meta-meta-analyses do the same: the result may look polished, but it’s far removed from the original evidence and just as prone to collapse under scrutiny.
By the time a number is produced at the end of a meta-meta-analysis, it’s often impossible to disentangle where the bias originates or how severely it distorts the result (see Brandolini’s law, which states that the amount of energy needed to refute nonsense is an order of magnitude greater than to produce it). The process compounds issues, but conceals it under the guise of a comprehensive evidence synthesis. Rather than clarifying what we know, meta-meta-analyses risk amplifying what we misunderstand.
Heterogeneity at the wrong level
In a standard meta-analysis, heterogeneity denotes the variation in effect sizes across individual studies—variation we can often explain through study-level characteristics like population differences, intervention types, or methodological quality. But in a meta-meta-analysis, heterogeneity refers to differences between meta-analyses. It doesn’t tell us much–it’s confounded by differences in analytic decisions, inclusion/exclusion criteria, or year when the literature search was conducted rather than substantive study differences. Additionally, if heterogeneity is low, it might give the misleading impression that the literature is consistent and homogeneous, even though the homogeneity is simply product of an additional layer of abstraction.
In effect, quantifying heterogeneity at the meta-meta-analysis level is meaningless and strips away the meaningful variation across studies.
Violated independence
An assumption in meta-analysis is that the effect sizes being pooled are statistically independent. Meta-meta-analyses routinely violate this by including multiple meta-analyses that draw from overlapping sets of primary studies. This redundancy creates dependencies in the data–artificially inflating the weight of certain studies and distorting both estimates and standard errors.
Study overlap is often assessed using the Corrected Covered Area (CCA) index, which quantifies the degree of duplication across meta-analyses. However, instead of being incorporated into the statistical model, CCA is typically used only as a filter: meta-analyses with “too much” overlap are excluded based on arbitrary thresholds—slight (<5%), moderate (6–10%), high (11–15%), or very high (>15%).
But filtering is not modeling. It does nothing to correct for the underlying statistical dependence among effect sizes. Even small amounts of undetected or uncorrected overlap can lead to inflated precision and misleading conclusions. Meta-meta-analyses that ignore this problem are reporting uncertainty that is unjustifiably narrow, and conclusions that are more confident than the data allow.
What Are We Even Estimating?
At the heart of any serious statistical analysis is a clearly defined estimand–the specific quantity of scientific interest. In a well-designed meta-analysis, the estimand might be something like the average treatment effect across a defined population of studies, under ideal conditions, or within a particular subgroup. Even then, identifying and justifying the estimand is no trivial task.
In a meta-meta-analysis, the situation becomes obscured. What exactly does the mean of mean effect sizes represent? It’s not the average effect across all studies–those are hidden inside the meta-analyses. It’s not the average effect across all populations, interventions, or outcomes–those are inconsistently defined and often incommensurable across meta-analyses.
Instead, we’re left with an undefined quantity that bears no clear relationship to any real-world intervention or scientific question. It floats free of context, untethered to a well-specified target.
This lack of a coherent estimand makes interpretation not just difficult, but arguably meaningless. Without knowing what question you’re answering, the answer–however precisely computed–is at best ambiguous, and at worst, misleading.
The only workable solution: meta-analyze the primary studies themselves
If your goal is to synthesize the research results in a given domain that has already been meta-analyzed multiple times, the most principled approach is also the most labor-intensive: extract all the unique primary studies from each meta-analysis and conduct a standard meta-analysis yourself. This allows for consistent inclusion criteria, uniform effect size calculation, proper handling of duplicate studies, consistent analytic decisions–none of which can be guaranteed when simply pooling published meta-analyses.
Yes, this takes time. Yes, it requires manual effort. But good science isn’t supposed to be fast. When you synthesize at the study level, you maintain control over what’s included, how it’s coded, and how quality is assessed. You can examine moderator variables properly, assess publication bias, and apply up-to-date modeling strategies. You’re synthesizing evidence, not recycling others’ summaries.
Meta-meta-analyses trade precision and rigor for expediency. If your aim is rigor, there’s no substitute for doing the hard work.
Modeling the Madness
Let’s formalize a statistical model that attempts to represent what a meta-meta-analysis is doing. Suppose we try to model each mean effect size \(\hat{\mu}_{m}\) from meta-analysis \(m\) as follows:
\[ \hat{\mu}_{m} = \psi + \xi_{m} + \varepsilon_{m} \]
Where:
- \(\psi\) is the mean of means effect sizes across meta-analyses,
- \(\xi_m \sim \mathcal{N}(0, \phi^2)\) captures variability between meta-analyses,
- \(\varepsilon_{m} \sim \mathcal{N}(0, \sigma^2)\) is the deviations due to within-meta-analysis variance for meta-analysis \(m\). This would ultimately encompass both between-study variance and sampling error.
At first glance, this looks like a reasonable hierarchical model. But in practice, it quickly becomes untenable.
Problem 1: The Estimand Is Unclear
What exactly does \(\psi\) represent? In this model, it’s defined as \(\psi = \mathbb{E}(\hat{\mu}_m)\)—the expected value of mean effect sizes across meta-analyses. But this immediately raises a deeper problem: what is the expectation over? That is, what is the population of meta-analyses being sampled from? Meta-analyses differ in scope, inclusion criteria, populations, outcome definitions, and effect size metrics. There’s no well-defined superpopulation of meta-analyses to anchor this expectation, making \(\psi\) hard to interpret in any meaningful way.
Rather than estimating a coherent treatment effect, \(\psi\) becomes a statistical artifact rather than a well-defined scientific quantity.
Problem 2: \(\phi^2\) Is Unexplainable
In this model, \(\phi^2\) captures the variance in effect size estimates between meta-analyses. In principle, this is analogous to heterogeneity (\(\tau^2\)) in a regular meta-analysis, which we often explore using study-level moderators. But in meta-meta-analysis, we lose that resolution entirely.
First, we typically don’t have access to the raw study data needed to evaluate moderators meaningfully. Second–and more subtly–even when individual meta-analyses stratify by useful variables (e.g., age, dosage, study quality), that information becomes highly coarse or inconsistent at the meta-meta level. One meta-analysis may define “high quality” based on blinding; another based on sample size; a third may not define it at all. These moderator definitions are not harmonized, and their meaning shifts between meta-analyses.
As a result, \(\phi^2\) becomes a kind of unexplained, uninterpretable heterogeneity–absorbing all the differences in methods, populations, metrics, and contexts that vary across meta-analyses. It’s a variance you can estimate, but not interpret. The model gives you a number, but no meaningful insight.
Problem 3: Assumes Independence, Ignores Overlap
Although this point might seem trivial given that both \(\psi\) and \(\phi^2\) already lack clear meaning, it’s worth emphasizing that the model also violates a basic statistical assumption: independence. The model treats each meta-analysis estimate \(\hat{\mu}_m\) as if it were generated from a distinct, non-overlapping set of primary studies. In reality, that’s rarely the case.
Primary studies are often included in multiple meta-analyses, sometimes dozens of them. This overlap introduces dependencies among the \(\hat{\mu}_m\) values that are not modeled—leading to two serious problems. First, duplicated studies are effectively given more weight, biasing the overall estimate \(\psi\). Second, standard errors and confidence intervals become artificially narrow because the model assumes more information than actually exists.
While metrics like the Corrected Covered Area (CCA) can detect this redundancy, they are usually applied only as loose diagnostics—if at all—and not integrated into the analysis itself. Even modest levels of unmodeled overlap can skew results and inflate certainty.
In short, meta-meta-analyses don’t just suffer from vague estimands and uninterpretable variance—they also fail to meet the most basic statistical requirements for valid inference.
Conclusion
By stacking flawed summaries on top of each other, meta-meta-analyses obscure more than they reveal. Without a clear estimand, consistent methods, or proper handling of dependencies, the numbers they produce may look rigorous–but lack real meaning.
If we care about getting the science right, there’s no substitute for going back to the primary studies. Synthesis is valuable, but shortcuts aren’t.