key: cord-0297871-egh6p5k6
authors: Cárdenas, Pablo; Corredor, Vladimir; Santos-Vega, Mauricio
title: Genomic epidemiological models describe pathogen evolution across fitness valleys
date: 2022-05-11
journal: bioRxiv
DOI: 10.1101/2021.12.16.473045
sha: 1358edabaa3f66b5f1779095fa8d00fa41dba737
doc_id: 297871
cord_uid: egh6p5k6

Genomics is fundamentally changing epidemiological research. However, systematically exploring hypotheses in pathogen evolution requires new modeling tools. Models intertwining pathogen epidemiology and genomic evolution can help understand processes such as the emergence of novel pathogen genotypes with higher transmissibility or resistance to treatment. In this work, we present Opqua, a flexible simulation framework that explicitly links epidemiology to sequence evolution and selection. We use Opqua to study determinants of evolution across fitness valleys. We confirm that competition can limit evolution in high transmission environments and find that low transmission, host mobility, and complex pathogen life cycles facilitate reaching new adaptive peaks through population bottlenecks and decoupling of selective pressures. The results show the potential of genomic epidemiological modeling as a tool in infectious disease research.

Genomic epidemiology has become a powerful tool in the study and control of infectious disease spread. By sequencing the genomes of pathogens in the field, researchers can access a living ledger of transmission and evolution in real-time, which aids understanding of disease spread and can inform public health policy. Sequencing of pathogen genomes has been used to monitor evolution, trace local chains of transmission, and study the origins of outbreaks in real-time for Ebola (1-3), malaria (4) (5) (6) , influenza (7, 8) , and COVID-19 (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) , among others.

An arsenal of tools and initiatives have been developed to harness the full potential of data generated through genomic surveillance (4, (20) (21) (22) (23) .

Despite all its advantages, genomic epidemiology is hampered by its ability to experimentally explore hypotheses beyond observational data. This is a problem shared by most epidemiological research. Fortunately, mathematical and computational modeling provide frameworks for investigating the effect of individual variables through the simulation of null models and perfectly-controlled, targeted interventions, which would otherwise be impossible in practice (12, (24) (25) (26) . However, most traditional modeling frameworks are not well-suited to work with genomic data and the kinds of evolutionary questions unlocked by genomic epidemiology. For example, different computational tools have been developed to simulate the evolution of pathogen genetic sequences (27) (28) (29) , but in these cases, the tools are not coupled to epidemiological models that follow disease spread.

More recently, there has been interest in developing tools to combine epidemiological models with genetic sequence simulators [such as mrc-ide.github.io/SIMPLEGEN/index.html, (30) (31) (32) (33) ].

While these approaches can be powerful tools to answer questions about population genetics, they do not consider the effect of genetic sequences on disease dynamics. This makes them unable to account for selection in pathogen evolution. More general tools are available to simulate the evolution of organism populations in forward-time (34) (35) (36) . Although these may be adapted to use in epidemiology, this requires significant work and setup, especially when tailoring to different kinds of disease transmission and intervention. Further recent work has been done to explore evolving epidemiological models in a custom, disease-specific context (37) . Nevertheless, there currently is no out-of-the-box solution for building flexible, easy-to-use simulations of disease spread with pathogens capable of evolving and influencing their epidemiology through natural selection.

Simulating selection and evolution becomes particularly pressing as we ask questions about how epidemiology affects the appearance of advantageous traits within a pathogen population.

Specifically, we are interested in how epidemiological contexts can shape competition and clonal interference between pathogens to hamper the evolution of novel traits separated by Cárdenas, Corredor, & Santos-Vega, post-review preprint 10 May, 2022 : 10.1101 /2021 p. 3/49 fitness valleys. Significant work has examined the evolutionary dynamics of crossing fitness valleys, a process termed stochastic tunneling (38) (39) (40) (41) . Other work has examined the environmental factors that shape these adaptive landscapes (42) . Still, quantitative research into stochastic tunneling in the context of a spreading infection has only recently become an object of study (37) . This is notable given the interest in the evolutionary biology of infectious diseases sparked by the COVID-19 pandemic. The evolutionary trajectory of its causing agent, SARS-CoV-2, has brought interest into the role of epidemiology in shaping evolutionary trajectories across fitness valleys. Thus, epidemiology becomes crucial in understanding future viral evolution and vaccine effectiveness. The emergence of novel, evolutionarily distant SARS-CoV-2 variants such as Omicron is a concerning testament to the importance of these questions, as this new variant bears multiple compensatory mutations and epistatic interactions that individually decrease fitness, but have a synergistic effect on transmission (43, 44) . Studying the eco-epidemiological determinants of evolution is also key to understanding the emergence of drug resistance in diseases such as tuberculosis and malaria, where fitness costs of resistant mutations are offset by a succession of compensatory mutations (45, 46) .

In light of this, we developed Opqua (github.com/pablocarderam/opqua) as an epidemiological modeling framework for pathogen population genetics and evolution. We apply Opqua to explore how different aspects of the epidemiological environment and infection biology constrain or enhance the ability of pathogens to escape local fitness peaks and explore new evolutionary space in adaptive landscapes.

In order to address questions in infectious disease evolution, we developed a library for flexible epidemiological simulations of evolving pathogens. The library is named Opqua, after the word for disease, cause, or reason in the Chibcha family of languages spoken by members of the late Muysca Confederation in modern-day Colombia (47) .

Cárdenas, Corredor, & Santos-Vega, post-review preprint 10 May, 2022 10 May, : 10.1101 10 May, /2021 p. 4/49

Opqua models account for pathogens with arbitrary genomes and genomic alphabets spreading through structured populations of hosts and/or vectors, which may acquire immunity and/or undergo interventions such as drug treatment or vaccination. (B) Eight possible kinds of events can occur in simulations, the rates of which may be influenced by pathogen genome sequence. (C) Models may include interventions at different moments in time, which in this example include addition of new hosts and vectors infected with pathogens of specific genome sequences, altering epidemiological conditions to increase host-vector contact rate, administration of a treatment that kills all pathogens except those with a specific, resistant genotype, and finally, vaccination against the same genotype. (D) Models may have arbitrary, complex metapopulation structures, here showing the spread of disease from two low-transmission populations ("A" and "B") into five interconnected populations with high transmission (labeled 1-5), while an additional isolated population remains disease-free. (E) Models can simulate evolution and selection through different ways, here showing a population of pathogens that evolve through direct intra-host competition and de novo mutation from an initial, low-fitness genotype ("BADD") to a high-fitness, optimal genotype ("BEST").

Cárdenas, Corredor, & Santos-Vega, post-review preprint 10 May, 2022 10 May, : 10.1101 10 May, /2021 Opqua stochastically simulates interconnected populations of agents representing hosts and/or vectors, which may be infected with pathogens ( Fig. 1) . Simulations occur through different kinds of demographic, epidemiological, immunological, and genomic events that affect the state of the system, and thus the probability of the next event ( Fig. 1B; Fig. S1 ). User-defined epidemiological parameters affect the rate of each event and can vary both across populations and throughout the course of each simulation, allowing the user to model different ecological events or public health interventions (Fig. 1C ). Hosts and vectors may acquire immunity to specific pathogen genome sequences, and may undergo demographic change by natural death and reproduction, as well as migration within complex population structures (Fig. 1D ). 

Using Opqua's evolutionary modeling capabilities, we examined the epidemiological determinants of pathogen stochastic tunneling across fitness valleys (39, 41) . We began by examining whether competition for transmission between pathogens in high transmission environments can inhibit stochastic tunneling. This hypothesis has been proposed as a partial explanation for why antimalarial resistance repeatedly emerges outside of the high-transmission environments in sub-Saharan Africa that contribute the bulk of global malaria cases and treatment (37, 48, 49) . Cárdenas, Corredor, & Santos-Vega, post-review preprint 10 May, 2022 : 10.1101 /2021 p. 6/49

A model system contains pathogen genomes with eight different bi-allelic loci. Fitness follows a valley-like distribution according to a genome's "B" alleles, where the "0-B" wild-type sequence outcompetes all mutant genomes except for the "8-B" sequence, which encodes for higher competitive fitness, lower recovery rates, and drug resistance. (B) Over time, tracking infections in 1000 hosts shows that pathogens in a high-transmission environment are eradicated during a drug administration event. However, drug effectiveness is halved in a low transmission environment, as resistant mutants can evolve. (C) An optimal, low contact rate favors the evolution of the drug resistance trait in this system. (D) By tracking the genetic composition of the pathogen population over time, we see that at high contact rates (β), most mutants remain evolutionarily close to the wild-type, whereas low contact rates lead to the breakthrough and population sweep of distant mutants with high fitness. (E) Even when removing all mutant advantages, the frequency and duration of mutant-only host infections is greater at low contact rates, but is counteracted by low infected host totals. The interplay between these two forces determines stochastic tunneling across the fitness valley. We established a simple model in which mutating pathogens compete with each other for transmission between hosts within a single population. Pathogen fitness follows a valley-like pattern in which an initial fitness peak of "wild-type" genomes is separated by a sequence of mutations from a second, higher fitness peak of "resistant" genomes capable of surviving drug treatment and causing longer infections ( Fig. 2A) . When pathogens with different genomes coexist within the same individual, those with more fit genomes have a greater probability of transmission and mutation, due to their greater share of the intra-host pathogen population.

After letting the system evolve for a certain amount of time, we apply a drug treatment that kills all pathogens, save for those with the resistant genome. We can track the genomic composition of populations over time to observe the evolutionary paths taken in each environment (Fig. 2D ). When transmission rates are high, mutant pathogens are rarely able to survive for long within hosts without wild-type pathogens coinfecting and outcompeting them. In contrast, low transmission environments allow mutants to evolve for longer without interference from wild-type pathogens. This leads to a greater frequency of distant mutants in low transmission environments, which eventually result in the appearance and fixation of resistant pathogens. Thus, low transmission environments generate the cryptic genetic variation that has been shown to underlie evolution to new adaptive peaks in other experimental and theoretical models (50, 51) .

However, by varying the contact rate, we can see that the relationship between transmission and evolution is not linear (Fig. 2C) . Instead, the likelihood of pathogens evolving resistant genotypes is a result of two opposing forces. On one hand, there is the likelihood of having hosts infected by low-fitness mutants for long periods of time without wild-type competitors that interfere with further evolution towards more distant genotypes. This likelihood is greater in low transmission environments where competition is weak, as long as transmission is strong enough to support a stable pathogen population (Fig. 2E , top and middle). On the other hand,

we have the likelihood of survival for strains that reach new fitness peaks. This likelihood is greater in high transmission environments, which avoid stochastic extinction events (Fig. 2E , bottom). The balance between these opposed dynamics determines an optimal transmission level for evolution (Fig. 2C) , dependent on the features of the adaptive landscape. In particular, low transmission regimes can only increase stochastic tunneling through fitness valleys if the mutational distance between peaks and the drop in fitness between them is large enough to require evolution to occur throughout a transmission chain of multiple hosts (Fig.   S2 ). Long chains of transmission facilitate tunneling through long, deep valleys due to multiple bottlenecks and opportunities to minimize pathogen competition. However, shorter valleys not only permit stochastic tunneling to occur in short chains of transmission (or even single hosts), but also imply a more restricted number of mutational paths through the valley. This decreases the likelihood of low transmission regimes leading to stochastic tunneling due to their smaller pathogen populations and thus net mutation rates. In this way, high transmission environments are more effective at taking single evolutionary steps with significant fitness differences, whereas low transmission environments are more effective at taking multiple, successive steps with minor differences in fitness. Our results show these small steps are ultimately more effective at tunneling through long evolutionary distances with low overall fitness.

We next focused on the evolutionary dynamics of pathogens mutating into low fitness regimes. pathogen population, even when the fitness cost of mutations is considerably high (Fig. 3B ).

The model also shows an optimal transmission intensity at which "mutant-only" infections peak for any given mutation fitness cost (Fig. 3C ), similar to the results of the fitness valley model. The optimal transmission level for mutant-only infections decreases as the competitive fitness disadvantage of mutants becomes more prominent. We also varied the infected host fraction while keeping contact rates constant by changing the host recovery rates, which showed essentially the same dynamics of mutant prevalence as seen from the contact rates studied (Fig. S4 ). This shows that the differences in mutant prevalence at different contact rates arise mainly from competition between wild-type and mutant pathogens.

We then constructed a stochastic model using Opqua to study the same type of behavior in a genomic context. This model had epidemiological parameters identical to those in the compartment model, with the only difference of allowing hosts to become infected with pathogens containing complex genomes simulating a peptide sequence of interest with 500 amino acids (Fig. 3D ). Each mutation decreased pathogen fitness by a set factor, resulting in an exponentially-decreasing fitness landscape away from the single peak of the wild-type genome sequence. This descending fitness regime, analogous to an infinitely long valley, can be used to simplify the study of evolution in the initial part of a fitness valley.

By plotting the average of multiple simulation replicates at different contact rates and mutant fitness costs, we reproduced the behaviors observed in the compartment model, with two differences (Fig. 3E, 3F) . First, contact rates that are too low in the genomic Opqua model result in stochastic extinction of all pathogens (counted as a zero fraction of mutants). Second, the genomic model shows greater mutant prevalence at low mutation fitness costs and higher contact rates than the equivalent parameters in the compartment model. This is a consequence of genetic drift: if different genomes have comparable fitness and transmission is not a scarce resource, the wild-type genome becomes just one among a large number of similar, competing genotypes, and is subject to stochastic extinction like its peers.

Furthermore, genomic models allow us to analyze the sequences and distributions of genomes that evolved throughout the simulations and construct a quantitative description of the different kinds of evolutionary space explored. For instance, high transmission regimes always result in a higher number of genomes explored, due to higher numbers of infections providing a greater net mutation frequency (Fig. 3G ). However, even though low contact rates result in fewer mutants, those mutants can evolve further from the wild-type starting point even at high fitness costs to mutation (Fig. 3H) , echoing the results of the fitness valley model. The resulting genomes at the end of the simulation are also more different from each other at low contact rates (Fig. 3I) . Thus, genomic epidemiological models more accurately capture the behavior of systems with high genetic dimensionality and provide a quantitative description of how transmission environment and within-host competition shape pathogen cryptic variation. 

With these metrics of genomic evolution, we examined other kinds of epidemiological determinants of evolution beyond transmission intensity, starting with host population size and inter-population mobility (Fig. 4) .

Increasing population sizes initially leads to a greater fraction of mutants among the pathogen population, since small host populations lead to stochastic extinction events for all pathogens (including mutants, which is counted as zero; Fig. 4A ). At slightly larger population sizes, the pathogen population is stable enough to avoid complete extinction, but small enough to be susceptible to genetic drift and allow slightly unfit mutants to fixate, driving wild-type pathogens extinct. At even larger host population sizes, the pathogen population is less susceptible to genetic drift and the stochastic extinction of more fit wild-type pathogens, leading to a slightly diminished equilibrium fraction of mutants. We see similar dynamics by varying host inter-population migration rates in a metapopulation model with a fixed number of total hosts (Fig. 4B ). Low mobility rates make the metapopulation system behave like a series of small, isolated populations, while high mobility makes the system behave like a single, well-mixed, large population.

However, different behaviors emerge when examining genomes. While larger populations lead to more infections and thus a greater number of unique genomes explored (Fig. 4C) , increased mobility does not increase infected hosts, and therefore unique genomes, beyond a certain point (Fig. 4D ). The stochastic extinction of high-fitness wild-type pathogens in small populations allows for an early peak in genome distance from wild type, before descending to a slightly lower stable value for larger populations (Fig. 4E) . Intriguingly, the most notable increase in the maximum evolutionary distance explored comes from low host mobility systems, provided host mobility is high enough to sustain continued infection ( This may occur, for instance, if a gene fulfills roles in host-pathogen interactions or is subject to different kinds of immune responses within both hosts and vectors, as is common in many arboviruses (57, 58) . It may also be a consequence of antimicrobial use if a resistant allele with fitness costs is selected for in hosts (but not vectors) due to drug treatment (45, 46) .

Unsurprisingly, increasing mutation rates always increase mutant prevalence among pathogens (Fig. 5H) . Consider a regime of host selection only (Fig. 5H) The effect of these processes is replicated symmetrically when considering selection regimes in both hosts and vectors (Fig. S8H) . In this case, the greatest long-distance evolution occurs when a mismatch in inoculum sizes is present (this is to say, a high inoculum from hosts to vectors Additionally, recombination can increase the maximum evolutionary distance from wild-type explored by pathogens. In a host-vector system with selection in hosts only, the effect of recombination on evolutionary distance is largest if recombination occurs within vectors, a life stage where the new genotypes will not be outcompeted by their parents (Fig. 5I ). If selection occurs with separate fitness landscapes in hosts and vectors, the effect becomes symmetrical for both life stages, due to recombination of genomes clustered around each of the two peaks (Fig.   S8I ). Nevertheless, recombination in hosts can also increase the maximum evolutionary distance in host-host models (Fig. S5I) . The magnitude of this effect is dependent on the mean inoculum size (the infection effective population size), as small inoculums remove the within-host variation from which recombination can occur (Fig. S6 ). This shows recombination can also play a relevant role in stochastic tunneling of infections with host-host transmission. 

In this work, we present Opqua, a flexible computational modeling framework capable of simulating genomic epidemiology. We use it to study pathogen competition and evolutionary dynamics in fitness valleys and descending fitness regimes with a quantitative, systematic approach. This allows us to better understand how pathogens acquire new fitness-conferring traits that require multiple, separate epistatic mutations, a process known as stochastic tunneling (39, 41) . We confirm that competition between pathogens in settings with high disease prevalence can pose a significant barrier to evolution in certain adaptive landscapes, as suggested by others (37) . We establish the potential of genomic epidemiological models to provide more complete descriptions of simulated evolution. We then use these descriptions to This can be seen in malaria parasites, which undergo strong population bottlenecks in mosquitoes and within human livers, before intra-host competition in the bloodstream occurs.

These bottlenecks reduce the number of individual pathogens by ten orders of magnitude from its highest point within the human bloodstream (59) , and have been shown to qualitatively alter the pathogen's evolutionary dynamics (60 Our results also have important eco-epidemiological implications. When considering the evolution of distant pathogen genotypes with higher fitness through stochastic tunneling, rural communities consisting of small, semi-isolated groups with relatively low transmission may be more at risk than larger populations with high transmission. It is interesting to consider that Omicron's emergence may well be an example of this kind of stochastic tunneling in small, remote populations with cryptic transmission, as proposed by others (61) . Other explanations such as evolution in individual chronic infections or through reverse zoonosis have found some evidence in their favor as explanations for the emergence of Omicron, and it is doubtless that selective pressure from acquired immunity played a key role (62) . However, it is worth noting tunneling. This does not in any way imply that eradication efforts are pointless-after all, evolution is always a possibility, while disease burden is a certainty. Nevertheless, it might help explain why elimination campaigns are so challenging to complete, despite initial success.

We deliberately chose to focus on intra-host competition, as this is the minimum requirement for blocking stochastic tunneling through competitive exclusion. Therefore, we avoided using Opqua's ability to modify the intrinsic transmissibility of pathogens according to genome sequence, even though this would increase the magnitude of the effects studied. In addition, by choosing to center on competitive dynamics, we opted to keep the effects of acquired immunity beyond the scope of this study, even though Opqua has the capabilities to simulate them.

However, immune selection plays a central role in pathogen evolution by shifting the fitness landscape over time. This has been proposed in the context of the evolution of malaria parasites (37) and is doubtless a crucial factor in the emergence of SARS-CoV-2 variants that evade acquired immune responses (62) . Finally, we chose to explore idealized disease models rather than fit a specific disease both for generalizability and computational feasibility. Future work improving the computational performance of these kinds of modeling approaches may allow for models that more closely fit real-world epidemiological and genetic data, as well as models that simulate within-host dynamics such as drift in a more explicit fashion. Nevertheless, we hope this work helps establish the potential of genomic epidemiological models as a tool to test scenarios, explore hypotheses, and understand the relationship between pathogen evolution and epidemiology. Quantitative descriptions of how biology and environment shape pathogen evolution, such as those presented here, may one day help inform the design of interventions in disease control, ranging from drug and vaccine development to public health policy.

To address the questions posed in this study, we developed Opqua, an epidemiological modeling framework for pathogen population genetics and evolution. Opqua stochastically simulates pathogens with distinct, evolving genotypes that spread through host populations with specific acquired immune profiles. Opqua is available for download through PyPI (pypi.org/project/opqua) with installation, usage instructions, and source code published on GitHub (github.com/pablocarderam/opqua).

Opqua models are composed of populations containing hosts and/or vectors, which themselves may be infected by a number of pathogens with different genomes.

A genome is represented as a string of characters. All genomes must be of the same length (a set number of loci), and each position within the genome can have one of many different characters specified by the user (corresponding to different alleles). Different loci in the genome may have different possible alleles available to them. Genomes may be composed of separate chromosomes or genome segments separated by the "/" character, which is reserved for this purpose.

Each population may have its own unique parameters dictating the events that happen inside of it, including how pathogens are spread between its hosts and vectors.

There are different kinds of events that may occur to hosts and vectors in a population: -contact between an infectious host/vector and another host/vector in the same population (intra-population contact, or "contact") or in a different population Inter-population contacts occur via the same mechanism as intra-population contacts, with the distinction that the two populations must be linked in a compatible way. The recovery of an infected host or vector results in all pathogens being removed from the host/vector. Additionally, the host/vector may optionally gain protection from pathogens that contain specific genome sequences present in the genomes of the pathogens it recovered from, representing immune memory. The user may specify a population parameter delimiting the contiguous loci in the genome that are saved on the recovered host/vector as "protection sequences". Pathogens containing any of the host/vector's protection sequences will not be able to infect the host/vector.

Births result in a new host/vector that may optionally inherit its parent's protection sequences.

Additionally, a parent may optionally infect its offspring at birth following a Poisson sampling process equivalent to the one described for contact events above. Interventions can also include actions that involve specific hosts or vectors in a given population, such as: -adding pathogens with specific genomes to a host/vector -removing all protection sequences from some hosts/vectors in a population -applying a "treatment" in a population that cures some of its hosts/vectors of pathogens -applying a "vaccine" in a population that protects some of its hosts/vectors from pathogen infection For these kinds of interventions involving specific pathogens in a population, the user may choose to apply them to a randomly-sampled fraction of hosts/vectors in a population, or to a specific group of individuals. This is useful when simulating consecutive interventions on the same specific group within a population. A single model may contain multiple groups of individuals and the same individual may be a member of multiple different groups. Individuals remain in the same group even if they migrate away from the population they were chosen in.

When a host/vector is given a "treatment", it removes all pathogens within the host/vector that do not contain a collection of sequence motifs called "resistance sequences". A treatment may have multiple resistance sequences. A pathogen must contain all of these within its genome in order to avoid being removed. On the other hand, applying a vaccine consists of adding a specific protection sequence to hosts/vectors, which behaves as explained above for recovered hosts/vectors when they acquire immune protection, in models that allow it.

Models are simulated using an implementation of the Gillespie algorithm in which the rates of different kinds of events across different populations are computed with each population's parameters and current state, and are then stored in a matrix. In addition, each population has host and vector matrices containing coefficients that represent the contribution of each host and vector, respectively, to the rates in the master model rate matrix (Fig. S1 ). Each coefficient is dependent on the genomes of the pathogens infecting its corresponding vector or host. In addition, the contribution of each pathogen genome is weighted by the share of total competitive fitness it holds within the host or vector. This occurs under the simplifying assumption that in a coinfection, pathogens with more competitive fitness will have higher intra-host (or intra-vector) shares of the pathogen population.

This approach also assumes that on the timescales of most short coinfections, pathogen 10 May, 2022 10 May, : 10.1101 10 May, /2021 hosts and/or vectors may be saved instead of the entire model as a means of reducing memory footprint.

The output of a model can be saved in multiple ways. The model state at each saved time point may be output in a single data frame and saved as a tabular file. Other data output types include counts of pathogen genomes or protection sequences for the model, as well as time of first emergence for each pathogen genome and genome distance matrices for every time point Users can also use the data output formats to make their own custom plots.

We used the Opqua modeling framework to study the evolutionary dynamics of pathogens across low fitness regimes. The code to run all simulations and analyses presented in this work is available in a separate GitHub repository (github.com/pablocarderam/fitness_valleys_opqua).

The simulations were run on a 2019 MacBook Pro laptop with a 2.4 GHz 8-Core Intel Core i9

processor.

To simulate evolution across a fitness valley (Fig. 2) , we constructed a stochastic host-host transmission model using Opqua. The model comprises 1000 hosts in a single population, 500 of which start the simulation infected by "wild-type" pathogens with a genome of "AAAAAAAA". The epidemiological parameters were set to the default values shown for host-host models in Table S1 , with the following modifications: a genome length of 8 loci with two possible alleles ("A" and "B") at each locus, a mean inoculum of 1 pathogen, a recovery rate of 0.005 1/unit time, a mutation rate of 0.02 1/unit time, and no recombination. The recovery rate of "resistant" pathogens was set to be 75% of the original for pathogens with a genome of "BBBBBBBB". The choice of using bi-allelic rather than multi-allelic loci (such as four DNA bases, 20 amino acids, or other kinds of allelic variations) was made for simplicity and to more easily observe stochastic tunneling. In systems with many more possible alleles, the probability of reaching a specific other genotype is exponentially lower, and therefore the number of simulation replicates needed to observe differences in stochastic tunneling rates rises exponentially as well.

Most importantly, the intra-host competitive fitness φ of each genome was made to follow the following function, describing a fitness valley in terms of the number of "B" alleles in the genome b and the total length of the genome g:

This function was chosen as it describes a steep initial decrease in fitness as pathogens mutate away from the wild type, followed by a more gradual increase in fitness to the resistant genotype. It also guarantees that all mutant pathogens have lower competitive fitness than the wild-type, save for the resistant mutant genome, which outcompetes even the wild type. In this way, resistant pathogens fixate in the population more easily once they arise, which facilitates observation of the stochastic tunneling phenomena being studied.

Additionally, a "drug treatment" intervention was added at time 10,000 such that all pathogens with genomes other than the resistant genotype were cleared from the population.

The simulations shown in Fig. 2E aims to illustrate the evolution of strictly less-fit mutants in the system. Therefore, the previous setup was modified to finish simulation at 10000 time units with no drug intervention, and the contact and mutation rates of resistant pathogens were modified to be equal to zero. This effectively makes the resistant genotype into a lethal mutation, allowing easier study of the remainder of less-fit genotypes without interference. To study the distribution of survival times for mutants that are alone before they become coinfected by a wild-type pathogen or are cleared from the host, we extracted the survival times for 100 mutant-only infections in a single simulation run for each contact rate.

To evaluate the effect of different kinds of fitness valleys (Fig. S2) 

To study the competitive dynamics of wild-type pathogens and mutant pathogens with lower fitness, we developed a compartment model describing the number of hosts infected with no pathogens (S), wild-type pathogens only (W), mutant pathogens only (M), or coinfected with both kinds of pathogens (C). This model is presented in Fig. 3A , and shown in more detail in Fig. S3 . The system considers a constant host population N such that N = S + M + W + C. The following system of ordinary differential equations (ODEs) describes the flow between host compartments in terms of host recovery rate , contact rate β, mutation rates from wild-type to mutants µ 1 and vice-versa µ 2 , inoculum size n i , and the probability of wild-type pathogens with higher fitness outcompeting mutants in intra-host competition ψ:

For models in which only a single mutation is possible such as the model above, the relationship between the probability parameter ψ used in the ODE model, the relative intra-host competitive fitness φ used in Opqua models, and the fitness disadvantage of mutation λ shown on Figs. 3 and S4 is given by

Solving the model at equilibrium leads to S* = N / β for all nontrivial solutions, but the remaining three steady-state compartments yield unwieldy expressions that are difficult to analyze. We numerically solved the deterministic behavior of this model using the parameters described in Table S3 from starting conditions of 10 infected individuals, taking the system state after 1000 time units as a result. Mutations that restore the exact wild-type genotype were assumed to be far less likely than mutations away from the genotype (µ 1 <<µ 2 = 0).

To establish the potential of genomic epidemiological models as tools to study evolution, we created a stochastic model using Opqua following and expanding our results from the compartment model (Fig. 3) . The model is composed of 500 hosts in a single population, 250 of which start the simulation infected by "wild-type" pathogens with a specific genome sequence of 500 amino acids. Each simulation lasted for 1000 time units, and the result shown for each condition was obtained as the mean of ten replicates. The epidemiological parameters were set to the default values shown for host-host models in Table S1 , with the following modifications:

a genome length of 500 loci with 20 possible alleles (corresponding to the 20 amino acids abbreviations "ARNDCEQGHILKMFPSTWYV") at each locus, a mean inoculum of 1 pathogen, a mutation rate of 0.05 1/unit time, and no recombination. The full amino acid alphabet was used in this model to showcase the capabilities of Opqua and genomic epidemiological modeling as a way of describing complex diversity. Since this simulation does not aim to show stochastic tunneling to a specific new fitness peak, all evolutionary trajectories away from wild type are considered of interest, allowing for the use of more complex genomes and allele systems.

The intra-host competitive fitness φ of each genome was made to follow the following function, describing an exponential decay in fitness in terms of the relative fitness disadvantage of each successive mutation λ (shown on Fig. 3 ) and the Hamming distance d between a mutant genome to the wildtype genome:

The maximum distance between a mutant genome and wild-type and the mean pairwise distance between genomes at the end of simulation were both computed as Hamming distances.

We then used the same modeling framework to examine the effect of population size and migration between populations on evolution across descending fitness landscapes (Fig. 4) . The parameters and analyses used were the same as the ones used for the previous stochastic model describing a descending fitness landscape, unless otherwise noted on each figure. The host-host contact rate was fixed to β=0.125 1/unit time, and the fitness cost of each successive mutation was set to λ=0. 5 . For the metapopulation model, the system consisted of 10 different interconnected populations of 10 initial hosts each.

Finally, to study the effect of different biological factors on evolution across descending fitness landscapes, we modeled a host-vector system using Opqua (Fig. 5) . The model consisted of 250 hosts and 250 vectors in a single population, running for 1000 time units in 10 replicates for each condition. Model parameters were set to the default values shown on Table S3 , with a few modifications. All simulations had the host-vector contact rate set to 0.125 1/unit time. The fitness cost of each successive mutation was set to λ=0.5 within hosts only, and was set to λ=0

within vectors, to model the differing selection pressures on the genome in each organism.

Simulations which varied recombination rates had a mean number of crossover events in both hosts and vectors set to 5. Other variables were set as shown on Fig. 5 , or the default values on Table S2 if not noted.

An analogous set of simulations was repeated in a host-host transmission system as shown in Figs. S5 and S6. All parameters were identical as in the host-vector system described, except for the host-host contact rate set to 0.125 1/unit time and the host-vector contact rate set to zero, as well as the parameters indicated on each figure.

A third set of simulations was carried out using a host-vector transmission model identical to that used for Fig. 5 , but this time with separate fitness peaks for genomes in hosts and vectors (Figs. S7, S8). Each fitness peak was devised to be three mutations away from the wild-type sequence, and six mutations from each other peak. The three distinguishing mutations for each peak lie on opposite extremes of the 500-locus wild-type genome, in order to maximize likelihood of recombination. This allows clearer visualization of the effects of recombination in this system. All other parameters were identical as in the host-vector system described for Fig.   5 .

A. Mikhail, others, Real-time, portable genome sequencing for Ebola surveillance

post-review preprint 10 May

a Survivor With Virus Persistence in Seminal Fluid for More Than 500 Days

Identity-by-descent analyses for measuring population dynamics and selection in recombining pathogens

Quantifying connectivity between local Plasmodium falciparum malaria parasite populations using identity by descent

Bolotov, others, Large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution

Nextflu: real-time tracking of seasonal influenza virus evolution in humans

Communicable Diseases

post-review preprint 10 May

Tracking the international spread of SARS-CoV-2 lineages B.1.1.7 and B.1.351/501Y-V2 with grinch

Spatiotemporal invasion dynamics of SARS-CoV-2 lineage B. 1.1. 7 emergence

A. von Gottberg, S. Walaza, others, Sixteen novel lineages of SARS-CoV-2 in South Africa

Draper, others, Revealing COVID-19 transmission in Australia by SARS-CoV-2 genome sequencing and agent-based modeling

Xie, others, Viral genomes reveal patterns of the SARS-CoV-2 outbreak in Washington State

Castañeda, others, The arrival and spread of SARS-CoV-2 in Colombia

Genomic investigation of the coronavirus disease-2019 outbreak in the Republic of

An Evolutionary Portrait of the Progenitor SARS-CoV-2 and Its Dominant Offshoots in COVID-19 Pandemic

Assessing uncertainty in the rooting of the SARS-CoV-2 phylogeny

Kozlov, others, Phylogenetic analysis of SARS-CoV-2 data is difficult

GISAID: Global initiative on sharing all influenza data-from vision to reality

hmmIBD: software to Cárdenas

infer pairwise identity by descent between haploid genotypes

Nextstrain: real-time tracking of pathogen evolution

Abu-Dahab, B. Taylor, others, Assignment of Epidemiological Lineages in an Emerging Pandemic Using the Pangolin Tool

Mobility restrictions for the control of epidemics: when do they work? Plos One

Optimal, near-optimal, and robust epidemic control

Vaccine nationalism and the dynamics and control of SARS-CoV-2

MANTIS: an R package that simulates multilocus models of pathogen evolution

Modeling the genetic relatedness of Plasmodium falciparum parasites following meiotic recombination and cotransmission

SANTA-SIM: simulating viral sequence evolution dynamics under selection and recombination

SEEDY'(Simulation of Evolutionary and Epidemiological Dynamics): An R Package to Follow Accumulation of Within-Host Mutation in Pathogens

Elucidating relationships between P.falciparum prevalence and measures of genetic diversity with a combined genetic-epidemiological model of malaria

simuPOP: a forward-time population genetics simulation environment

Forward-time simulations of non-random mating populations using simuPOP

FFPopSim: an efficient forward simulation package for the evolution of large populations

Immune selection suppresses the emergence of drug resistance in malaria parasites but facilitates its spread

The roles of mutation, inbreeding, crossbreeding and selection in evolution

post-review preprint 10 May

Stochastic Tunnels in Evolutionary Dynamics

The pace of evolution across fitness valleys

Stochastic tunneling across fitness valleys can give rise to a logarithmic long-term fitness trajectory

Environmental changes bridge evolutionary valleys

SARS-CoV-2 variant prediction and antiviral drug design are enabled by RBD in vitro evolution

Selection Analysis Identifies Clusters of Unusual Mutational Changes in Omicron Lineage BA.1 That Likely Impact Spike Function

Drug resistance, fitness and compensatory mutations in Mycobacterium tuberculosis

The parasitophorous vacuole nutrient channel is critical for drug access in malaria parasites and modulates the artemisinin resistance fitness cost

Opqua -Diccionario muysca -español

The origins and spread of antimalarial drug resistance: Lessons for policy makers

Spread of artemisinin resistance in Plasmodium falciparum malaria

Cryptic genetic variation accelerates evolution by opening access to diverse adaptive peaks

Cryptic genetic variation: evolution's hidden substrate

Single-cell RNA-seq reveals hidden transcriptional variation in malaria parasites

The Malaria Cell Atlas: Single parasite transcriptomes

post-review preprint 10 May

across the complete Plasmodium life cycle

A single-cell atlas of Plasmodium falciparum transmission through the mosquito

Three Members of the 6-cys Protein Family of Plasmodium Play a Role in Gamete Fertility

Malaria transmission-blocking antigen, Pfs230, mediates human red blood cell binding to exflagellating male parasites and oocyst production

RNAi Targeting of West Nile Virus in Mosquito Midguts Promotes Virus Diversification

Effects of Arbovirus Multi-Host Life Cycles on Dinucleotide and Codon Usage Patterns

When Is a Plasmodium-Infected Mosquito an Infectious Mosquito?

Malaria life cycle intensifies both natural selection and random genetic drift

SARS-CoV-2 prolonged infection during advanced HIV disease evolves extensive immune escape

CoVariants: SARS-CoV-2 Mutations and Variants of Interest

post-review preprint 10 May

Research Centre (Centre de recherches pour le développement international) -Grant

The authors would like to gratefully acknowledge Manuela Carrasquilla, Juan Estupiñán, Alejandro Feged Rivadeneira, Felipe González Casabianca, Angélica Knudson, and David Suárez for their valuable insights throughout the development of this project, as well as Christopher V.Turlo and Jelle van der Hilst for feedback on figure design. 

None declared.

All code necessary to install and use the Opqua modeling framework is available at https://github.com/pablocarderam/opqua. All data and code necessary to reproduce the simulations and analyses of this specific study is available at https://github.com/pablocarderam/fitness_valleys_opqua/.

Tables S1 -S3Cárdenas, Corredor, & Santos-Vega, post-review preprint 10 May, 2022 10 May, : 10.1101 10 May, /2021 p. 37 Some events involve sampling an additional population (migration or inter-population contact), host/vector (inter-or intra-population contact), or pathogen (recombination). Once event type, population(s), host(s) and/or vector(s), and pathogen(s) have been chosen in this manner, the state of the model is adjusted according to the event, and the relevant rate changes are propagated upward from within the pathogens affected to the host(s), vector(s), populations, and overall model they are in. Fig. S4 . Duration of infection affects mutant fraction primarily through competition for free hosts, as with contact rate. Using the ordinary differential equation model described, we vary the duration of infection (equivalent to 1/recovery rate) while keeping the contact rate constant such that the range of steady-state infected hosts sampled is equivalent to those in Figure Fig. S6 . Recombination depends on inoculum size to increase evolutionary distance. By keeping the mutation rate (µ) constant and reducing the mean inoculum size (n i ) to 1 in a host-host transmission model with a descending fitness landscape, we can see the effects of recombination on pathogen genome evolution with low inoculums. (A) As before, increasing recombination rate does not increase the mutant fraction of pathogens. However, the lower inoculum greatly reduces the effect of recombination on (B) the number of unique pathogen genomes explored and (C) the maximum Hamming distance from the wild-type pathogen genotype. Neither of these shows increases on the same scale seen for the higher mean inoculum size n i =10 in Fig. S3 . Fig. S7 . Separate fitness landscapes within hosts and vectors can be used to study pathogen evolution in Opqua. Two fitness functions are devised with exponentially decaying fitness around distinct, optimal genome sequences. These optimal genomes constitute separate fitness peaks for pathogens within hosts and vectors. Each fitness peak is at Levenshtein distance of six mutations from the other. The wild-type (WT) genome sequence used to initiate simulations lies halfway between both peaks, three mutations from each. Tables S1 -S3