Check for new replies
On the limits of fitting complex models of population history to f-statistics
#1
On the limits of fitting complex models of population history to f-statistics
Robert Maier1*†, Pavel Flegontov1,2*†, Olga Flegontova2, Ulaş Işıldak2,Piya Changmai2, David Reich1,3,4,5*

https://elifesciences.org/articles/85492

Abstract 
Quote:Our understanding of population history in deep time has been assisted by fitting
admixture graphs (AGs) to data: models that specify the ordering of population splits and mixtures,
which along with the amount of genetic drift and the proportions of mixture, is the only information
needed to predict the patterns of allele frequency correlation among populations. The space
of possible AGs relating populations is vast, and thus most published studies have identified fitting
AGs through a manual process driven by prior hypotheses, leaving the majority of alternative
models unexplored. Here, we develop a method for systematically searching the space of all AGs
that can incorporate non-genetic information in the form of topology constraints. We implement
this findGraphs tool within a software package, ADMIXTOOLS 2, which is a reimplementation of
the ADMIXTOOLS software with new features and large performance gains. We apply this methodology
to identify alternative models to AGs that played key roles in eight publications and find that
in nearly all cases many alternative models fit nominally or significantly better than the published
one. Our results suggest that strong claims about population history from AGs should only be made
when all well-fitting and temporally plausible models share common topological features. Our
re-evaluation of published data also provides insight into the population histories of humans, dogs,
and horses, identifying features that are stable across the models we explored, as well as scenarios
of populations relationships that differ in important ways from models that have been highlighted in
the literature.

Conclusions
Quote:Sampling AG space is a useful method for modeling population histories, but finding robust and accurate
models can be challenging. As we demonstrated by revisiting a handful of published AGs and
re-analyzing the datasets used to fit them, f-statistics are usually insufficient for identifying uniquely
fitting AG models, making it necessary to incorporate other sources of evidence. This provides a
challenge to previous approaches for automated model building. We investigated several published
AG models and, in nearly all cases, found many alternative models, some of which are historically and
geographically plausible but contradict conclusions that were derived from the published models. To
conduct these analyses, we developed a method for automated AG topology optimization that can
incorporate external sources of information as topological constraints. This method is developed in
the ADMIXTOOLS 2 framework, which aside from AG modeling, implements many other methods for
population history inference based on f-statistics.

It is important to recognize that the key concern we have highlighted in this study—the fact
that there can often be thousands of different topologies that are equally good fits to the allele
frequency correlation patterns relating a set of populations—does not invalidate the use of allele
frequency correlation testing in many other contexts in which it has been applied to make inferences
about population history. For example, negative f3-statistics (‘admixture’ f3-statistics) continue to
provide unambiguous evidence for a history of mixture in tested populations, and f4-and D-symmetry
statistics remain powerful ways to evaluate whether a tested pair of populations is consistent with
descending from a common ancestral population since separation from the ancestors of two groups
used for comparison. The qpWave methodology remains a fully valid generalization of f4-statistics,
making it possible to test whether a set of populations is consistent with descending from a specified
number of ancestral populations (which separated at earlier times from a comparison set of populations).
In addition, Haak et al., 2015 and Harney et al., 2021 the qpAdm extension of qpWave—
which allows for estimating proportions of mixtures for the tested population under the assumption
that we have data from the source populations for the mixture—remains a valid approach, unaffected
by the concerns identified here. Instead of relying on a specific model of deep population relationships,
qpAdm relies on an empirically measured covariance matrix of f4-statistics
for the analyzed populations, which is highly constraining with respect to estimation of mixture proportions but can
be consistent with a wide range of deep history models. All these methods are implemented in
ADMIXTOOLS 2.

Finally, approaches that use AGs to adjust for the covariance structure relating a set of populations
without insisting that the particular AG model that is proposed is true with can be useful, for example
for the purpose of analyzing shared genetic drift patterns of a group of populations that derive from
similar mixtures. One example was a study that attempted to test for different source populations for
Neolithic migrations into the Balkans after controlling for different proportions of hunter–gatherer
admixture (Mathieson et al., 2018). Another example was a study that attempted to study shared
ancestry between different East African forager populations after controlling for different proportions
of deeply divergent source populations (Lipson et al., 2022). However, with respect to the inferences
about deep history produced by AGs themselves, our results highlight the importance of caution in
proposing specific models of population history that relate a set of groups.
DevourerOfCheese and Albruic like this post


Attached Files Thumbnail(s)
   
Reply
#2
It's rather interesting, on small/medium sized graphs, as the authors say, automated exploration (find_graphs) will almost always find better fitting solutions than manually constructed graphs. On the other hand, for large/complex graphs (many populations + admixtures), find_graphs will sometimes perform worse than manually constructing a graph. What happens is find_graphs gets stuck in local minimums, and given the complexity of the graph, to get out would require a number of changes happening simultaneously that are too unlikely for it to stumble across by chance. For large graphs, it's best to construct a few graphs testing different hypothesis, and then allow find_graphs to see if it can find a better solution to each, rather than to let it have free reign from the start.
AimSmall likes this post
Reply
#3
Interesting video on aspects of this subject I'm currently watching.

Erin Molloy (Sankararaman Lab), Postdoc, Computer Science
Inst. for Quantitative & Computational Biosciences


“Advancing admixture graph estimation via maximum likelihood network orientation.”
UCLA QCBio Spring 2021 Research Seminars.

https://www.youtube.com/watch?v=Q2MMlSpoj74


Link to the software mentioned in the video.

https://github.com/sriramlab/OrientAGraph

OrientAGraph implements Maximum Likelihood Network Orientation (MNLO) within TreeMix, a popular package for estimating admixture graphs from f-statistics (and related quantities). In our experimental study, we found that MLNO either improved or else did not impact the accuracy of the original TreeMix search heuristic. To learn more, check out this paper with Arun Durvasula and Sriram Sankararaman.
Reply

Check for new replies

Forum Jump:


Users browsing this thread: 1 Guest(s)