key: cord-0004449-m6yj1p92 authors: Gibson, Keylie M.; Jair, Kamwing; Castel, Amanda D.; Bendall, Matthew L.; Wilbourn, Brittany; Jordan, Jeanne A.; Crandall, Keith A.; Pérez-Losada, Marcos title: A cross-sectional study to characterize local HIV-1 dynamics in Washington, DC using next-generation sequencing date: 2020-02-06 journal: Sci Rep DOI: 10.1038/s41598-020-58410-y sha: 65f58703255fe6afd5695312803e12281f4da353 doc_id: 4449 cord_uid: m6yj1p92 Washington, DC continues to experience a generalized HIV-1 epidemic. We characterized the local phylodynamics of HIV-1 in DC using next-generation sequencing (NGS) data. Viral samples from 68 participants from 2016 through 2017 were sequenced and paired with epidemiological data. Phylogenetic and network inferences, drug resistant mutations (DRMs), subtypes and HIV-1 diversity estimations were completed. Haplotypes were reconstructed to infer transmission clusters. Phylodynamic inferences based on the HIV-1 polymerase (pol) and envelope genes (env) were compared. Higher HIV-1 diversity (n.s.) was seen in men who have sex with men, heterosexual, and male participants in DC. 54.0% of the participants contained at least one DRM. The 40–49 year-olds showed the highest prevalence of DRMs (22.9%). Phylogenetic analysis of pol and env sequences grouped 31.9–33.8% of the participants into clusters. HIV-TRACE grouped 2.9–12.8% of participants when using consensus sequences and 9.0–64.2% when using haplotypes. NGS allowed us to characterize the local phylodynamics of HIV-1 in DC more broadly and accurately, given a better representation of its diversity and dynamics. Reconstructed haplotypes provided novel and deeper phylodynamic insights, which led to networks linking a higher number of participants. Our understanding of the HIV-1 epidemic was expanded with the powerful coupling of HIV-1 NGS data with epidemiological data. HIV-1 diversity in DC. We performed in-depth phylodynamic profiling of HIV sequences from 171 PCR gene products that passed quality thresholds, including 62 PR/RT, 62 int, and 47 env amplicons. The subtyping analyses showed that all of the participants belonged to subtype B; therefore, subsequent analyses included data from all participants. Our participants were dispersed amongst and showed a star-like pattern with other DC HIV PR/RT sequences (see Supplementary Fig. S1 online) . The env(c) data showed higher nucleotide diversity (π) and Watterson genetic diversity (θ) than pol(c) ( Table 2 ). Males had a higher diversity, though not significant, than females (haplotype diversity: PR/RT: p = 0.2039, int: p = 0.9571, env: p = 0.3404). Participants whose risk factor was IDU (n = 6) had 50% less diversity than those with MSM and HRH risk, though again not significant (haplotype diversity: PR/RT: p = 0.7323, int: p = 0.7861, env: p = 0.6560). Non-Hispanic black participants had a higher genetic diversity for the pol(c) gene than for the env(c) gene ( Table 2 ). The average haplotype diversity when calculated with the number of reconstructed haplotypes by PredictHaplo showed that env had more haplotype diversity compared to PRRT and int ( Table 3 ). The average number of haplotypes per participant was the same for PR/RT and int (2 haplotypes) and slightly higher for env (4) . Four of the six participants that had a higher number of haplotypes reconstructed (7-12 haplotypes in one or more gene regions) also had a higher average haplotype diversity estimate of 0.634 (range: 0.392-0.777), while the other two had a very low average haplotype diversity estimate (0.032, range: 0.027-0.038). Participants with HRH and MSM risk factors were found to have an average of 3 reconstructed haplotypes, with haplotype diversities of 0.338 and 0.335, respectively. www.nature.com/scientificreports www.nature.com/scientificreports/ Transmission clusters. Transmission clusters were assessed by phylogenetic methods and HIV-TRACE, a genetic-distance based clustering method. Phylogenetic methods found support (>70% bootstrap or >0.95 posterior probability) for 33.8% of the sequences associated with six transmission clusters in pol(c) and for 31.9% of the sequences associated with seven clusters in env(c) (highlighted in Fig. 1 ). All of the clusters were comprised of two to three sequences, except one cluster in pol(c) which had twelve sequences. Ten of the twelve sequences in the large pol(c) cluster contained a DRM within RT. Furthermore, 11 of the 12 sequences in this large cluster were included on the same sequencing run; therefore, we were unable to undoubtfully discriminate between laboratory artifacts or batch effects and HIV infection to interpret this transmission cluster. However, there were other samples from that same sequencing run that did not cluster with these twelve sequences and formed a dyad cluster. The most common DRM was T215C, which does not reduce NRTI susceptibility, and was found in eight of the twelve sequences. HIV-TRACE grouped only a few sequences into two clusters for pol(c) (12.8%) and into a single cluster for env(c) (2.9%) (Fig. 2I ). More transmission clusters were estimated with the haplotypes reconstructed from PredictHaplo (Fig. 2II , Table 5 ). For PR, RT, and int (genes also used in past DC HIV studies 28, 29 ), 82.1% of our participants were incorporated into transmission clusters. Unique to our dataset was the use of envelope to predict transmission clusters; 35.8% of our participants were included in V1V2 and V3 transmission clusters. Haplotypes were not concatenated for this analysis, but little overlap of cluster composition was found between gene regions. Often one participant's haplotype would cluster with another participant in one gene region, but it would cluster with a different participant in a different gene region. Thus different gene regions displayed different transmission clusters that would not have been detected if one only analyzes a single gene. Clusters predicted in all gene regions were mainly composed of two participants (42 of the 55 total clusters). Likewise, the majority of the transmission clusters predicted by past DC HIV studies 28, 29 were comprised of two or three participants, regardless of the gene region, clustering estimation method, or haplotype/consensus construction approach. Collectively, our study aimed to characterize the local and recent phylodynamics of a subset of DC Cohort participants from Washington, DC metro area living with HIV. By combining clinical and behavioral data with NGS data, we were able to identify transmission clusters across groups with different demographics and risk behaviors. Additionally, this study reconstructed sequence variants present within a participant (i.e., intra-host) and investigated associations between the reported participant characteristics and transmission clusters. www.nature.com/scientificreports www.nature.com/scientificreports/ Our population genetic estimators indicate that HIV-1 env(c) is more genetically diverse than pol(c). Similar diversity estimates were found in other studies 19, [29] [30] [31] . Our estimate of genetic diversity for pol(c) in DC (θ = 0.079) was lower than those reported for subtype B in Pérez-Losada et al. 29 (θ = 0.084 and 0.090), but higher than those currently reported for the US subtype B sequences in Los Alamos HIV-1 database (θ = 0.075 for int; θ = 0.067 for PR/RT). This same trend was seen in env(c), where our DC HIV-1 genetic diversity (θ = 0.202) was greater than that reported for V1-V3 from Los Alamos HIV-1 database (θ = 0.168) across the US. Haplotype diversity estimated from the PredictHaplo results also found env genes more genetically diverse than PR/RT and int. Notably, there were interesting differences in diversity estimates among risk groups. Participants who were infected through injection drug use (IDU) were found to have about half of the diversity of MSM and HRH participants. This low diversity, potentially associated with low multiplicity of HIV-1 infection, is not unusual within IDU individuals 32, 33 . In both pol(c) and env(c), males also showed higher genetic diversity, which could be attributed to half (51%) of the males in this study being infected through sex with other men (16% of the males had an unknown risk factor). Our measurements of HIV-1 diversity in HRH were also high and similar to those of MSM; however, we did not observe differences in HIV genetic diversity by race or ethnicity. The HIV-1 subtype B epidemic in DC is highly diverse, and our results here agree with previous conclusions suggesting a mature epidemic 29, 34 . High genetic diversity could result from risk groups intermingling and viral strains being exchanged and the transient nature of the DC metro area population. DC is an international stop for some, a temporary residence for others, and home for many. This constant influx of incomers could have a boosting effect on the DC HIV-1 population by consistently introducing new viral strains into the pool. Treatment and vaccine development can be compromised by high genetic diversity. As HIV-1 continues to evolve and, as seen here, high genetic diversity levels are kept constant over time 34 , resistance to vaccines and ART drugs may increase, which could ultimately lead to treatment and prevention failures 35 . A Drug Resistant Mutation (DRM) prevalence of 48.6-54.0% was detected in this subset of the DC Cohort, depending on use of consensus sequence or haplotypes. Lower DRM prevalence rates were previously reported for the DC area 28, 29, 34, 36 (17.3-37 .9% between 1994 and 2016). A much higher rate (66%) was reported in a smaller study of ART treatment-naïve and experienced pediatric patients in Rhode Island 37 . DRM rates over 50% are also seen in large sequence databases in the UK and Switzerland 29 . Our study found fewer codons affected by a DRM and fewer DRMs in our participants than previous studies 28, 29 of the DC epidemic. We only found 32 and 38 codons to be affected when analyzing consensus sequences and haplotypes, respectively, whereas Pérez-Losada et al. 29 found 83 codons affected for subtype B, however that study contained 20 times more sequences than ours. Moreover, our study found three novel DRM sites that were not identified in either previous study: P145PAST (IN Major) and A128APST, Q146QH (IN Accessory). More recently, a study by Kuhnert and colleagues 38 reported on the fitness of fourteen HIV-1 resistance mutations, of which seven were detected in RT in our study. Three of the seven DRMs (codons: 41 L, 67 N, 184 V) were NRTI-related and the other four DRMs (codons: 103 N, 108I, 138 A, 181 C) were NNRTI-related. Six DC Cohort participants contained the 184 V DRM, which was found by Kuhnert et al. 38 to have the highest transmission cost -i.e., the success (low transmission cost) or lack thereof (high transmission cost) of transmission of hosts infected by drug resistant strains. Because of this high cost to the virus, the mutation resulted in very short transmission chains despite evolving frequently under treatment failure 38 . Of the six DC Cohort participants, all were included in a cluster when using the haplotypes, but none of the participants clustered with each other. Additionally, this mutation was found to be persistent in the DC HIV-1 viral population since 2005 34 . Previous studies showed that the most important NNRTI mutation currently is 103 N because of its connection to first-line treatment failure [39] [40] [41] ; a quarter of our DC Cohort participants with one or more DRMs contained this mutation, which was also found to be at a low frequency in the DC HIV-1 viral population since 2005 34 . In treatment-naïve participants, both when using consensus sequences and haplotypes, we estimated a low prevalence rate of DRMs (14.3%). Similarly, low prevalence rates have been seen in the past in the US ( 42 ; 15% between 1999 and 2011) and even lower in treatment-naïve individuals in Europe ( 41, (43) (44) (45) ; 10% between 2001 and 2013). Moreover, we also found fewer DRMs in treatment-naïve participants compared to a recent study of PLWH in DC 28 (22. 5% between 1994 and 2013). Kassaye et al. 28 also observed a downward trend of overall prevalence of DRMs over time in treatment-naïve individuals. Our results suggest a further decrease in overall prevalence of DRMs in the current DC HIV-1 epidemic. A lower prevalence of DRMs in surveillance versus targeted treatment-naïve studies could result from sampling design. A surveillance study of DC would likely provide the more accurate picture of DRM trends in the population. Novel sites that continue to evolve in the DC epidemic and have not become fixed in the population are of serious concern for future drug therapy and conferring resistance to these drugs. More and different codons were predicted to be under positive selection in the haplotype sequences than the consensus sequences. FUBAR predicted sites in V1V2 are likely being impacted by the immune system, which the virus is actively trying to evade (diversifying selection). V3 is associated with co-receptor binding 46 ; codons predicted here could be rising advantageous mutations by HIV-1 to adapt to the host cells' response against the virus. Five codons (PR: 37, RT: 35, and int: 201, 265, 283) were identified by both our study, in both haplotypes and consensus sequences, and Pérez-Losada et al. 29 as sites under selection for subtype B. Since none of these sites corresponded to any known Stanford DRMs, these may be newly evolving resistance mutations in the DC HIV-1 epidemic. Given that all of our participants were on dual-or multiple-drug regimens, these sites may also be indicative of potential escape mechanisms by the virus in response to multiple-drugs treatments. Thus, these amino acid replacements are candidates for fitness testing with and without associated drugs to infer their ability to confer drug resistance, their relative fitness status in different environments, and their transmissibility across individuals. Additionally, identifying transmission clusters is critical to recognizing groups who may be at risk of contracting HIV-1 or who may already be infected but are not yet aware of their diagnosis. Phylogenetic studies suggest that transmission clusters greatly contribute to the spread of HIV-1 within the population 47 ; therefore, identifying Scientific RepoRtS | (2020) 10:1989 | https://doi.org/10.1038/s41598-020-58410-y www.nature.com/scientificreports www.nature.com/scientificreports/ high-risk groups, whether that is based on risk behavior or geographical location 48 , can help public health officials to better target prevention efforts and treatment options. The spread of infection is often associated with early HIV-1 infection 47 , consequently, molecular surveillance of the DC epidemic should continue in order to identify potential areas or clusters of transmission and, thus, help lower the HIV-1 incidence. Towards this goal, we combined NGS sequence data with clinical characteristics to obtain a dynamic picture of the evolution of HIV-1 in the DC metro area. We detected high levels of clustering using haplotypes (32.8-64.2%, not including the V1V2 region), as also seen in other HIV-1 cohorts (24-65%) 29, 35, [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] . Other studies, including one completed with Sanger sequencing in DC 28, 31, 36, 52, 62 , however, have found lower levels of clustering (7-17%), in agreement with those reported here for the consensus sequence phylogenetic clusters (pol(c) and env(c)) and the V1V2 region for haplotypes (9.0%). Therefore, a more comprehensive understanding of HIV-1 transmission events in DC has been achieved when evaluating multiple genes together, rather than primarily focusing on polymerase genes that are typically screened for DRMs in clinical settings or used in investigations at the local health department level. By excluding envelope genes, informative transmission events can be missed, which could hinder community health prevention and intervention efforts. In an ideal setting, using all the genetic information available would be most favorable when investigating local HIV-1 phylodynamics. In agreement with a recent study of HIV-1 transmission clusters in Chicago 59 , we also found association of risk factors within clusters. More HRH participants fell in our haplotype transmission networks compared to MSM, IDU, and participants with unknown risk (HRH = 23, MSM = 18, UNK = 11 each & IDU = 6). A total of 58.3% of the clusters that included an HRH participant also had an MSM participant. Likewise, a US study that included 12 major US cities 63 found transmission clusters that contained overlap between participants who were MSM and HRH. Mixing of risk types in HIV-1 subtype B transmission clusters has also been observed in Switzerland, Iceland, and Nordic European countries 60, 64, 65 . Contrarily, Kouyos et al. 65 found segregation based on location among individuals who were included in a transmission cluster despite having overlapping risk factors. Risk groups may be mixing due to underreporting of risk behaviors or bisexual behavior [65] [66] [67] . This heterogeneity of risk groups in transmission clusters suggests that focusing on individuals within city areas (e.g., wards in Washington, DC) to concentrate resources and information may help in addressing the HIV-1 epidemic. Otherwise, we were unable to determine the mode of transmission for the "unknown modes of transmission" group (16.2% of our sample). Nonetheless, our results suggest that the mode of transmission may not be as important for prevention and intervention efforts as the location where transmission events are occurring. Likewise, Morgan et al. 59 suggested not targeting efforts towards risk groups, but rather age groups, particularly younger people, in Chicago. The average age of the DC participants included in a haplotype transmission cluster was 46.6 years of age. If DC's younger population is being the most affected, as suggested by the new cases identified by the DC DOH in 2016 and 2017 1,3 , taking a spatial dynamic approach to intervention with continued surveillance may help. Through surveillance studies, further adapted location-based prevention efforts can be employed. Notably, our analysis has some limitations. We included only 68 participants of the approximately 10,000 people enrolled in the DC Cohort 3 , whereas past studies conducted in DC included 700 (Kassaye et al.) and 1,500 participants (Pérez-Losada et al.) . However, the demographics in our sample size are similar to PLWH in DC 3, 4 . As a prospective study conducted as part of an ongoing HIV-1 surveillance program associated with the DC Cohort, we capitalized on all the current cases that met our inclusion criteria (see Matierals and Methods: DC cohort). These past studies of HIV-1 diversity in Washington, DC were historical in nature and, therefore, had larger sample sizes available. Our study also applied a powerful next-generation sequencing approach instead of Sanger sequencing (previous DC studies), to characterize the current HIV-1 epidemic. With the implementation of NGS, mapping very diverse short reads to a reference genome poses alignment issues 14 , which can add difficulty to downstream analyses. We circumnavigated this alignment issue by using HAPHPIPE, where the reads Gene www.nature.com/scientificreports www.nature.com/scientificreports/ were mapped against a tailored reference genome, thus resulting in higher alignment rates and fewer errors. Nonetheless, aligning very diverse reads still remains an issue. Although new sites under selection were identified using NGS, their clinical relevance as potential DRMs requires further validation. We also recognize that we used conservative genetic distance cutoff values for determining transmission clusters, which could result in lower www.nature.com/scientificreports www.nature.com/scientificreports/ numbers of transmission clusters 68 ; however, this conservative estimate reduced the number of false positive transmission clusters. Finally, due to the nature of predicting transmission clusters and the potential for missing individuals, we are not able to determine the direction of infection within the transmission clusters. We were also unable to rule out batch effects or laboratory artifacts accounting for any transmission cluster which participants were included in the same sequencing run. This study showed that NGS and epidemiological data can be used to characterize the current phylodynamics of a subset of people living with HIV, enhancing our understanding of the diversity and local dynamics of the HIV-1 epidemic in the DC area. HIV-1 diversity in DC is high and seems to remain stable over time. Furthermore, NGS of the envelope gene provided sufficient coverage to compare transmission cluster inference across HIV-1 gene regions 14, 20 . Additional transmission clusters were identified when using HIV-1 intra-host haplotypes instead of consensus sequences, which led to networks linking a higher number of participants. Moreover, transmission clusters varied across genes, with each gene suggesting a different transmission story. Hence, using multiple HIV-1 genes or whole genomes is recommended to infer more reliable transmission clusters. Inferred clusters should then be linked to locations in DC to target transmission intervention efforts. Additionally, HIV-1 drug resistance was only found when using haplotypes in a single young adult in our cross-sectional sample of the cohort. Future studies should also focus on age groups and geographic regions rather than only risk factors. As the DC area maintains significant rates of HIV-1 infection, integrating present and past molecular data from previous studies conducted in DC in 2017 29 and 2013 28 will help to paint a comprehensive picture of the HIV-1 transmission and evolution of drug resistance in this high prevalence urban U.S. city. Future HIV-1 phylodynamic studies should also include more participants, particularly young adults, and newly diagnosed persons to provide a comprehensive view of DRM prevalence in treatment-naïve individuals in the DC area. Studies revealing the severity of transmitted drug resistance in the DC population may provide physicians and public health workers with additional information to design more effective treatment plans for newly diagnosed individuals and intervention strategies for targeted key populations. University IRB (which serves as the IRB of Record for eight of the participating sites), the DC DOH IRB, and the remaining site IRBs. Informed consent was obtained and documented prior to conducting study procedures. Sample collections from participants were performed in accordance with relevant guidelines and regulations. Dc cohort. Participants from the DC Cohort were recruited for this molecular epidemiology sub-study from January 2016 through May 2017. Eligibility criteria included current DC Cohort enrollment, ≥18 years of age, HIV-1 diagnosis within prior 12 months of enrollment or detectable HIV-1 viral load of ≥1,500 copies/mL, ability to provide written informed consent, and completion of a behavioral survey; a total of 104 participants met the eligibility criteria. Blood samples were collected at the clinical sites and transported to George Washington University for processing, targeted amplification, library preparation and NGS. Sample sequences were paired with clinical and demographic data retrieved from the database from the DC Cohort (Table 1) . Clinical and demographic characteristics collected included age, race/ethnicity, sex at birth, gender, country of birth, state of residence, zip code, HIV-1 risk factor, presence of co-infections (e.g., chlamydia, gonorrhea, syphilis, trichomoniasis, and Hepatitis B and C), duration of infection, CD4 count, viral load, ART exposure, ART regimen type, date of sample, and date of HIV-1 diagnosis. The paired data were de-identified and analyzed using the approaches described below. Sequence analyses. The raw sequence data for each patient were processed through HAPHPIPE (https:// github.com/gwcbi/haphpipe), a HAplotype reconstruction and PHylodynamics PIPEline for genome-wide assembly of viral consensus sequences and haplotypes 69 . Briefly, HAPHPIPE includes modules for quality trimming, error correction, assembly, and haplotype reconstruction. We put the raw sequencing FASTQ files through quality control and quality trimming with Trimmomatic 70 . Error correction of the reads was completed with an earlier version of HAPHPIPE that used BLESS 71 , and the cleaned reads were mapped against the current HIV-1 subtype B reference sequence HXB2 (Genbank accession: K03455) 72 . Through iterative refinement, the cleaned reads were then mapped back to the reference sequence generated in the mapping step with Bowtie2 73 . This iterative refinement step was completed twice, first using only a random subsampling of the reads (25% subsampling) participants within a cluster. d MSM = men who have sex with men; HRH = heterosexuals; IDU = injection drug users; UNK = unknown. e Average age in years. f Percentage of participants within a cluster that contained one or more DRMs. g Overlap was only assessed between the concatenated consensus pol and env genes and between genes (PR, RT, int, V1V2, and V3) with transmission clusters generated with haplotypes. Reported as the percentage of unique participants within the cluster that are found in another cluster in a different gene within the same sequence type (consensus vs haplotype sequences). Overlap was not assessed between sequence types. h One participant had a risk factor of other. www.nature.com/scientificreports www.nature.com/scientificreports/ and the fast-local mapping option to speed up the computational time, and second using all of the sequence reads and the very sensitive mapping option to further refine the individually crafted reference sequence. A consensus sequence was generated from the refined reference sequence, and a final refinement step was concluded with BLAST 74 against this refined consensus sequence. The resulting sequences were filtered to include participants that contained a passing amplicon, defined as having > 95% of the amplicon covered by 10x or greater read coverage. Amplicons that did not pass this filter were removed, and this subset was then used for subsequent phylodynamic analyses (Table 1) . Sequence data for each PCR amplicon (PR/RT, int, and env) were aligned individually using MAFFT with the L-INS-i algorithm 75 in Geneious (ver. 9.1.6) 76 . Protease (PR) and reverse transcriptase (RT) were extracted from the PR/RT amplicon, and env was further divided into the variable regions: gp120 V1V2 (HXB2 coordinates: 6615-6812) and gp120 V3 (HXB2: 6984-7349) . PR, RT, and int were concatenated into the pol(c) gene region, and V1V2 and V3 were concatenated into the env(c) gene region. Concatenated gene regions will be distinguished from whole gene regions by adding "(c)" to the end of the gene name. Each gene (PR, RT, int, V1V2, and V3) was extracted from the amplicon data to fulfill different purposes: (1) remove any nucleotides belonging to other genes, for example the env PCR amplicon contained primer sequences, and therefore nucleotides belonging to the vpu gene region; (2) simulate amplicon sizes that could be covered end to end by paired-end reads to be consistent and comparable to future NGS studies with HIV-1 when using PrimerID or other local haplotype phasing techniques 8, 77 ; and (3) account for differences in PCR performance between and within samples by extracting a common, high-coverage region. Therefore, missing data in this dataset were low, and often only due to amplification failure of an entire amplicon. Concatenating the genes into their respective gene regions (pol and env) retained variants in genes that are often studied, such as protease and reverse transcriptase for drug resistant mutations. It also allowed comparisons to past studies based on Sanger sequencing that used either parts of genes or whole genes. Our overall goal was to keep the integrity of the individual genes while using as much of the NGS data as possible. Identification of subtypes and drug resistant mutations. HIV-1 subtype identification was completed for each concatenated gene region (pol(c) and env(c)) using the REGA subtyping tool (version 3) 78, 79 . A total of 170 subtype reference sequences from the Los Alamos HIV-1 database (LANL; http://www.hiv.lanl.gov/) were included to assign the patient sequences to a particular subtype clade and validate the findings from REGA using phylogenetic methods described below. Drug resistant mutations were identified aligning the consensus concatenated gene nucleotide sequences with reference strains in the Stanford HIV Drug Resistance Database (https://hivdb.stanford.edu) using the HIVdb program 80 . Nucleotide positions under positive selection were identified using Fast Unconstrained Bayesian AppRoximation (FUBAR) 81 in HyPhy 82 . Recombination in our HIV-1 data was accounted for with GARD 83, 84 . phylogenetic analyses. Phylogenetic estimations were completed for each concatenated gene region. The best-fit model of molecular evolution 85 was estimated for each pol(c) and env(c) from the data using jModel-Test2 86 in CIPRES Science Gateway 87 . Amino acid positions corresponding to identified DRMs described above were removed prior to phylogenetic estimations to avoid potential bias due to selection. A maximum likelihood phylogenetic estimate using RAxML 88 was made for each region with the 3 codon-position partitions 89 . The branch support for the RAxML phylogenetic trees was estimated with a bootstrap approach with 1,000 replicates 90 . Bayesian trees were inferred using MrBayes 91 . Four Markov chains (one cold and three heated) were run for 8 × 10 8 generations sampling every 2,000 steps for each gene region, and each run was repeated twice. The output was analyzed in Tracer 92 to assess convergence and mixing of the chains. Subtypes references for subtype D (GenBank accessions: K03454, AY371157, AY253311, U88824) and circulating recombinant forms CRF28,42-BF (GenBank accessions: FJ213781, FJ358521, FJ670529) and CRF10-CD (GenBank accessions: AF289548, AF289549, AF289550) were pulled from LANL and used as proper outgroups for the phylogenetic analyses 93 . Additional RT sequences from DC 29 were included to observe how our data related to other DC sequences. We visualized the epidemiological data on the resulting trees with the Interactive Tree of Life (iTOL). of clinical variables to transmission clusters, it is ideal to characterize within patient viral variation as individual sequence variants (haplotypes) instead of combining all of the individual reads into a single consensus sequence 94, 95 . Therefore, haplotypes for each patient were predicted from the sequence data using HAPHPIPE's haplotype stages. Haplotype reconstruction was performed on each PR/RT, int, and env targeted PCR amplicons using PredictHaplo 96 . Each gene region (PR, RT, V1V2, and V3) was then extracted from the corresponding targeted amplicon and, using the methods described below, transmission clusters were estimated using the predicted haplotypes. No concatenation of the individual genes to form the regions pol(c) and env(c) was done with the haplotypes. Identification of transmission clusters. Transmission clusters were assessed for each pol(c) and env(c), as well as for each of the gene regions with the haplotypes, using phylogenetic methods 89, 91 and the genetic-distance based clustering method HIV-TRACE 97 . Phylogenetic transmission networks were defined as clades with bootstrap proportions ≥70 or posterior probabilities ≥95%. Genetic distance thresholds of 0.01 29, 62 and 0.02 98 substitutions/site were used for pol and env in HIV-TRACE, respectively, to identify potential transmission events. Ambiguities were handled with the HIV-TRACE option "average" to avoid biases and false positives and minimum overlap was 1/genetic distance threshold and adjusted for size of amplicon, as recommended. Default settings were used for the remaining parameters. Transmission clusters were compared between gene regions. District of Columbia Department of Health HIV/AIDS, H., STD and TB Administration (HAHSTA) District of Columbia Department of Health HIV/AIDS, H., STD and TB Administration (HAHSTA) District of Columbia Department of Health HIV/AIDS, H., STD and TB Administration (HAHSTA) District of Columbia Department of Health HIV/AIDS, H., STD and TB Administration (HAHSTA) Viral phylodynamics Microbial sequence typing in the genomic era RNA and DNA Sanger sequencing versus next-generation sequencing for HIV-1 drug resistance testing in treatment-naive patients Recent advances in inferring viral diversity from high-throughput sequencing data Rates of evolutionary change in viruses: patterns and determinants Low-Frequency Drug Resistance in HIV-Infected Ugandans on Antiretroviral Treatment Is Associated with Regimen Failure Low-abundance drug-resistant viral variants in chronically HIV-infected, antiretroviral treatment-naive patients significantly impact treatment outcomes Prevalence and Evolution of Low Frequency HIV Drug Resistance Mutations Detected by Ultra Deep Sequencing in Patients Experiencing First Line Antiretroviral Therapy Failure HIV drug resistance testing by high-multiplex "wide" sequencing on the MiSeq instrument HIV populations are large and accumulate high genetic diversity in a nonlinear fashion Defining HIV-1 transmission clusters based on sequence data HIV evolutionary dynamics within and among hosts Molecular tools for studying HIV transmission in sexual networks Ultrasensitive single-genome sequencing: accurate, targeted, next generation sequencing of HIV-1 RNA Population genomics of intrapatient HIV-1 evolution Molecular footprint of drug-selective pressure in a human immunodeficiency virus transmission chain Proteolytic processing of the human immunodeficiency virus envelope glycoprotein precursor decreases conformational flexibility HIV-1 envelope sequence-based diversity measures for identifying recent infections Analysis of genetic linkage of HIV from couples enrolled in the HIV Prevention Trials Network 052 trial The genealogical population dynamics of HIV-1 in a large transmission chain: bridging within and among host evolutionary rates Phylogenetic Inference of HIV Transmission Clusters Update of the Drug Resistance Mutations in HIV-1 The S230R Integrase Substitution Associated With Virus Load Rebound During Dolutegravir Monotherapy Confers Low-Level Resistance to Integrase Strand-Transfer Inhibitors Transmitted HIV Drug Resistance Is High and Longstanding in Metropolitan Characterization of HIV diversity, phylodynamics and drug resistance in HIV-1 Genetic Variability and Clinical Implications. ISRN Microbiol Phylodynamics of HIV-1 from a phase-III AIDS vaccine trial in North America Low Multiplicity of HIV-1 Infection and No Vaccine Enhancement in VAX003 Injection Drug Users A substantial transmission bottleneck among newly and recently HIV-1-infected injection drug users in St A 28-Year History of HIV-1 Drug Resistance and Transmission in Washington Phylodynamics of HIV-1 from a phase III AIDS vaccine trial in Validation of publicly-available software used in analyzing NGS data for HIV-1 drug resistance mutations and transmission networks in a Antiretroviral treatment failure, drug resistance, and subtype diversity in the only pediatric HIV clinic in Rhode Island Quantifying the fitness cost of HIV-1 drug resistance mutations through phylodynamics HIV-1 drug resistance before initiation or re-initiation of first-line antiretroviral therapy in low-income and middle-income countries: a systematic review and meta-regression analysis Effect of transmitted drug resistance on virological and immunological response to initial combination antiretroviral therapy for HIV (EuroCoord-CHAIN joint project): a European multicohort study Recent trends and patterns in HIV-1 transmitted drug resistance in the United Kingdom Trends in use of genotypic resistance testing and frequency of major drug resistance among antiretroviral-naive persons in the HIV Outpatient Study Estimating trends in the proportion of transmitted and acquired HIV drug resistance in a long term observational cohort in Germany HIV-1 subtype B-infected MSM may have driven the spread of transmitted resistant strains in France in 2007-12: impact on susceptibility to first-line strategies Transmission of HIV Drug Resistance and the Predicted Effect on Current First-line Regimens in Europe HIV: cell binding and entry The impact of transmission clusters on primary drug resistance in newly diagnosed HIV-1 infection Analyzing spatial clustering and the spatiotemporal nature and trends of HIV/AIDS prevalence using GIS: the case of Malawi Transmission networks of drug resistance acquired in primary/early stage HIV infection High rates of forward transmission events after acute/early HIV-1 infection Transmission networks of HIV-1 among men having sex with men in the Netherlands Genetic Analysis of Incident HIV-1 Strains Among Injection Drug Users in Bangkok: Evidence for Multiple Transmission Clusters During a Period of High Incidence Molecular epidemiology of HIV-1 in St Petersburg, Russia: predominance of subtype A, former Soviet Union variant, and identification of intrasubtype subclusters Epidemiological study of phylogenetic transmission clusters in a local HIV-1 epidemic reveals distinct differences between subtype B and non-B infections Impact of highly active antiretroviral therapy on the molecular epidemiology of newly diagnosed HIV infections HIV-1 transmission cluster with T215D revertant mutation among newly diagnosed patients from the Basque Country Transmission of HIV-1 during primary infection: relationship to sexual risk and sexually transmitted infections Analysis of HIV-1 pol sequences from Panama: identification of phylogenetic clusters within subtype B and detection of antiretroviral drug resistance mutations HIV-1 Infection and Transmission Networks of Younger People Molecular epidemiology of HIV-1 in Iceland: Early introductions, transmission dynamics and recent outbreaks among injection drug users Characteristics of HIV-infected USArmy soldiers linked in molecular transmission clusters Social and Genetic Networks of HIV-1 Transmission HIV Among MSM and Heterosexual Women in the United States: An Ecologic Analysis HIV-1 transmission between MSM and heterosexuals, and increasing proportions of circulating recombinant forms in the Nordic Countries. Virus Evol. 2, vew010 Molecular epidemiology reveals long-term changes in HIV type 1 subtype B transmission in Switzerland Enhanced use of phylogenetic data to inform public health approaches to HIV among men who have sex with men Phylogenetic insights into regional HIV transmission HIV-1 clade B pol evolution following primary infection Haplotype reconstruction and realtime phylodynamics for deep sequencing of intra-host viral populations Trimmomatic: a flexible trimmer for Illumina sequence data BLESS 2: accurate, memory-efficient and fast error correction method Numbering Positions in HIV Relative to HXB2CG Fast gapped-read alignment with Bowtie 2 BLAST: a more efficient report with usability improvements MAFFT multiple sequence alignment software version 7: improvements in performance and usability Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data Validates Template Sampling Depth and Greatly Reduces the Error Rate of Next-Generation Sequencing of HIV-1 Genomic RNA Populations An automated genotyping system for analysis of HIV-1 and other microbial sequences A standardized framework for accurate, high-throughput genotyping of recombinant and non-recombinant viral sequences Human immunodeficiency virus reverse transcriptase and protease sequence database FUBAR: a fast, unconstrained bayesian approximation for inferring selection HyPhy: hypothesis testing using phylogenies GARD: a genetic algorithm for recombination detection Automated phylogenetic detection of recombination using a genetic algorithm Selecting models of nucleotide substitution: An application to Human Immuno-deficiency Virus 1 (HIV-1) jModelTest 2: more models, new heuristics and parallel computing Proceedings of the Gateway Computing Environments Workshop (GCE Evolutionary Trees from DNA Sequences: A Maximum Likelihood Approach RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies Confidence limits on phylogenies: an approach using the bootstrap MRBAYES: Bayesian inference of phylogenetic trees The evolution of HIV: inferences using phylogenetics A Pan-HIV Strategy for Complete Genome Sequencing Multiplexed next-generation sequencing and de novo assembly to obtain near full-length HIV-1 genome from plasma virus Inference Using a Propagating Dirichlet Process Mixture Model Transmission Cluster Engine): a tool for large scale molecular epidemiology of HIV-1 and other rapidly evolving pathogens Identifying Transmission Clusters with Cluster Picker and HIV-TRACE Genetic diversity and molecular epidemiology of HIV transmission DnaSP 6: DNA Sequence Polymorphism Analysis of Large Data Sets DNA Polymorphism Detectable By Restriction Endonucleases R: A language and environment for statistical computing Children's National Medical Center Adolescent Clinic USA. 14 Kaiser Permanente 20 Veterans Affairs Medical Center First and foremost, we thank the participants who are involved with the DC Cohort. DC Cohort data in this manuscript were collected by the DC Cohort Study Group with investigators and research staff located at: Cerner Corporation The authors declare no competing interests. Supplementary information is available for this paper at https://doi.org/10.1038/s41598-020-58410-y.Correspondence and requests for materials should be addressed to K.M.G.Reprints and permissions information is available at www.nature.com/reprints.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.