02-08-2025, 02:18 AM
On the limits of fitting complex models of population history to f-statistics
Robert Maier1*†, Pavel Flegontov1,2*†, Olga Flegontova2, Ulaş Işıldak2,Piya Changmai2, David Reich1,3,4,5*
https://elifesciences.org/articles/85492
Abstract
Conclusions
Robert Maier1*†, Pavel Flegontov1,2*†, Olga Flegontova2, Ulaş Işıldak2,Piya Changmai2, David Reich1,3,4,5*
https://elifesciences.org/articles/85492
Abstract
Quote:Our understanding of population history in deep time has been assisted by fitting
admixture graphs (AGs) to data: models that specify the ordering of population splits and mixtures,
which along with the amount of genetic drift and the proportions of mixture, is the only information
needed to predict the patterns of allele frequency correlation among populations. The space
of possible AGs relating populations is vast, and thus most published studies have identified fitting
AGs through a manual process driven by prior hypotheses, leaving the majority of alternative
models unexplored. Here, we develop a method for systematically searching the space of all AGs
that can incorporate non-genetic information in the form of topology constraints. We implement
this findGraphs tool within a software package, ADMIXTOOLS 2, which is a reimplementation of
the ADMIXTOOLS software with new features and large performance gains. We apply this methodology
to identify alternative models to AGs that played key roles in eight publications and find that
in nearly all cases many alternative models fit nominally or significantly better than the published
one. Our results suggest that strong claims about population history from AGs should only be made
when all well-fitting and temporally plausible models share common topological features. Our
re-evaluation of published data also provides insight into the population histories of humans, dogs,
and horses, identifying features that are stable across the models we explored, as well as scenarios
of populations relationships that differ in important ways from models that have been highlighted in
the literature.
Conclusions
Quote:Sampling AG space is a useful method for modeling population histories, but finding robust and accurate
models can be challenging. As we demonstrated by revisiting a handful of published AGs and
re-analyzing the datasets used to fit them, f-statistics are usually insufficient for identifying uniquely
fitting AG models, making it necessary to incorporate other sources of evidence. This provides a
challenge to previous approaches for automated model building. We investigated several published
AG models and, in nearly all cases, found many alternative models, some of which are historically and
geographically plausible but contradict conclusions that were derived from the published models. To
conduct these analyses, we developed a method for automated AG topology optimization that can
incorporate external sources of information as topological constraints. This method is developed in
the ADMIXTOOLS 2 framework, which aside from AG modeling, implements many other methods for
population history inference based on f-statistics.
It is important to recognize that the key concern we have highlighted in this study—the fact
that there can often be thousands of different topologies that are equally good fits to the allele
frequency correlation patterns relating a set of populations—does not invalidate the use of allele
frequency correlation testing in many other contexts in which it has been applied to make inferences
about population history. For example, negative f3-statistics (‘admixture’ f3-statistics) continue to
provide unambiguous evidence for a history of mixture in tested populations, and f4-and D-symmetry
statistics remain powerful ways to evaluate whether a tested pair of populations is consistent with
descending from a common ancestral population since separation from the ancestors of two groups
used for comparison. The qpWave methodology remains a fully valid generalization of f4-statistics,
making it possible to test whether a set of populations is consistent with descending from a specified
number of ancestral populations (which separated at earlier times from a comparison set of populations).
In addition, Haak et al., 2015 and Harney et al., 2021 the qpAdm extension of qpWave—
which allows for estimating proportions of mixtures for the tested population under the assumption
that we have data from the source populations for the mixture—remains a valid approach, unaffected
by the concerns identified here. Instead of relying on a specific model of deep population relationships,
qpAdm relies on an empirically measured covariance matrix of f4-statistics
for the analyzed populations, which is highly constraining with respect to estimation of mixture proportions but can
be consistent with a wide range of deep history models. All these methods are implemented in
ADMIXTOOLS 2.
Finally, approaches that use AGs to adjust for the covariance structure relating a set of populations
without insisting that the particular AG model that is proposed is true with can be useful, for example
for the purpose of analyzing shared genetic drift patterns of a group of populations that derive from
similar mixtures. One example was a study that attempted to test for different source populations for
Neolithic migrations into the Balkans after controlling for different proportions of hunter–gatherer
admixture (Mathieson et al., 2018). Another example was a study that attempted to study shared
ancestry between different East African forager populations after controlling for different proportions
of deeply divergent source populations (Lipson et al., 2022). However, with respect to the inferences
about deep history produced by AGs themselves, our results highlight the importance of caution in
proposing specific models of population history that relate a set of groups.