Connect with us

Fitness

Standardized Phylogenetic Classification of Human Respiratory Syncytial Virus Below the Subgroup Level

Published

on

Standardized Phylogenetic Classification of Human Respiratory Syncytial Virus Below the Subgroup Level

Human respiratory syncytial virus (HRSV) is a leading cause of acute lower respiratory tract infection in children, elderly, and immunocompromised persons. In 2023, the US Food and Drug Administration and the European Medicines Agency approved the first HRSV vaccines (1,2). Simultaneously, a monoclonal antibody was approved for widespread use in infants and not limited to high-risk and premature children (3). The availability of HRSV immunization highlights the role of molecular epidemiology as a tool to monitor their efficacy. Standards for HRSV nomenclature for sharing of viral isolates and sequences in databases have been published (4). Nevertheless, a standardized HRSV phylogenetic classification system has yet to be defined and implemented.

Figure 1

Figure 1. The structure and genome of human respiratory syncytial virus (HRSV). A) Schematic of the HRSV virion structure detailing the location of structural proteins. B) Schematic of the HRSV genome organization…

In 2022, HRSV was designated as Orthopneumovirus hominis species within the Pneumoviridae family. Below species level are 2 antigenic groups, known as HRSV subgroup A (HRSV-A) and B (HRSV-B), that were previously referred to as subtypes (46). Within each subgroup, genotypes were initially defined based on statistically supported phylogenetic clades inferred with the second hypervariable region of the G gene (Figure 1, panels A, B) (7). The G gene, encoding the attachment glycoprotein, exhibits the highest genetic and antigenic variability. Of note, the gene has undergone a duplication of a 72-nt fragment in HRSV-A and 60-nt fragment in HRSV-B (Figure 1, panel B) (8,9).

To identify emerging genotypes, researchers have used genetic distances between phylogenetic clades and distinctive genetic features, accompanied by variable nomenclature based on the gene (GA1–GA7 in HRSV-A and GB1–GB4 in HRSV-B), country and subgroup (SAB1–SAB4 for South African genotypes in HRSV-B), or city and province (NA1–NA2 [Niigata] and ON1 [Ontario] in HRSV-A, BA1–BA9 [Buenos Aires] in HRSV-B) (716). Since 2020, alternative phylogenetic reclassifications have been proposed; Goya et al. established a hierarchical classification system for HRSV phylogenies, comprising genotypes, subgenotypes, and lineages, using the G gene (17). That framework enabled laboratories without capacity for whole-genome sequencing to conduct molecular epidemiology studies. Independently, Ramaekers et al. (18) proposed reclassifications into lineages and Chen et al. (19) into genotypes using complete HRSV genomes. Those approaches support comprehensive monitoring of viral evolution across all genes, including the F gene encoding the fusion protein, a crucial target for monoclonal antibodies and the foundation of approved HRSV vaccines (Figure 1, panel A). Of note, challenges in HRSV molecular epidemiology persisted within the reclassification-defined categories because of reliance on genetic or patristic distances between tree tips or nodes.

The milestones achieved in HRSV interventions have renewed interest in addressing the challenge of classifying HRSV below the subgroup level. Those advances prompted establishment of the HRSV Genotyping Consensus Consortium (RGCC), formed by HRSV and virus evolution experts aiming to provide standardized criteria for harmonizing global HRSV molecular surveillance efforts. We present a novel framework for HRSV classification below the subgroup level, based on current knowledge of HRSV diversity and evolution, focused on practical implementation for molecular epidemiology.

HRSV Sequences Dataset

We downloaded HRSV complete genomes from the National Center for Biotechnology Information Virus (https://www.ncbi.nlm.nih.gov/labs/virus) and GISAID EpiRSV (https://www.gisaid.org) databases through March 11, 2023, using a filter for sequence length >14,000-nt, obtained from human hosts and including the year and country of the sample collection (Appendix 1 Figure 1). We reserved sequences containing nucleotide ambiguities, indicating inadequate sequencing depth, for epidemiologic analysis but excluded them from formal lineage definition (Appendix 1).

We aligned sequences with MAFFT version 7.490, and inspected and corrected alignment artifacts with Aliview version 1.28 (https://ormbunkar.se/aliview), mainly in the G gene (20,21). We trimmed alignment ends to encompass complete genomes from the first codon of the first gene (NS1) to the last codon of the last gene (L). We considered partial genomes if the lack of sequence was within 50 nt of the genome ends. We used RSVsurver (https://rsvsurver.bii.a-star.edu.sg) to identify and remove genomes with nucleotide insertions or deletions causing frameshift in any open reading frame. After alignment trimming, detection of identical sequences prompted redundancy removal using BBmap (https://jgi.doe.gov/data-and-tools/software-tools/bbtools), resulting in the final set of 1,538 HRSV-A and 1,387 HRSV-B genomes (Appendix 1 Figure 1).

Baseline Agreements on the HRSV Classification Definition

Our proposed classification establishes HRSV lineages for viruses below subgroup level. Studies have shown that HRSV phylogenetic trees constructed with complete genomes exhibit superior resolution (1719). Therefore, we defined a classification system based on maximum-likelihood phylogenetic trees inferred from complete HRSV genomes. The maximum-likelihood algorithm formulates hypotheses about the evolutionary relationships among sequences; the implementation within IQ-TREE dealing with large datasets makes it particularly well suited to assert HRSV genomic phylogeny including sequences collected >50 years ago (22). We defined complete HRSV genomes to the nucleotide sequences spanning from the first codon of the first gene (NS1) to the last codon of the last gene (L). We considered almost-complete genomes if the sequence information gaps were within a 50-nt window at the genome ends. To define lineages, we only used genomes without nucleotide ambiguities (in accordance with the IUPAC code for nucleotide degeneracy).

Genomic Dataset Used for Lineages Definition

Figure 2

The global HRSV genomics surveillance landscape. HRSV genomes from National Center for Biotechnology Virus and GISAID (https://www.gisaid.org) databases through March 11, 2023, that met inclusion criteria used for classification are shown by year of sample collection and subgroup (A) and by country of origin (B). HRSV, human respiratory syncytial virus.

Figure 2. The global HRSV genomics surveillance landscape. HRSV genomes from National Center for Biotechnology Virus and GISAID (https://www.gisaid.org) databases through March 11, 2023, that met inclusion criteria used for…

Applying the established baseline agreements, we gathered 1,538 HRSV-A and 1,387 HRSV-B high-quality genomes from public databases. The dataset revealed a limited global HRSV genomic surveillance; Figure 2, panel A; Appendix 1 Figure 2). Since 2008, the number of genomes and representation of countries improved; a surge occurred after 2021, probably driven by expansion of viral genomics since the SARS-CoV-2 pandemic and the approval of the HRSV prophylactic treatments (Figure 2, panel A; Appendix 1 Figure 2). Considering delays in genome deposition in public databases, the number of genomes in 2022 may be higher than those used in this study. Regarding geographic representation, 9 countries (Australia, United Kingdom, New Zealand, United States, Argentina, Kenya, Morocco, Netherlands, and Brazil) submitted >100 genomes; only the United Kingdom achieved uninterrupted surveillance since 2008, but Australia deposited the most genomes globally (Figure 2, panel B).

Accurate Root Placement in HRSV Phylogenetic Trees

We reconstructed maximum-likelihood phylogenetic trees for the HRSV-A and HRSV-B datasets. We used 2 approaches to root the trees: the use of an outgroup, a conventional method for inferring the tree root using sequences known to be evolutionarily distant; and phylodynamic analysis, integrating temporal and phylogenetic patterns in virus evolution (Appendix 1). Both approaches consistently identified the same root for each subgroup cluster (Appendix 1 Figure 3). Phylodynamic analysis also identified 58 outlier sequences for HRSV-A and 2 for HRSV-B that were excluded from lineage designation. The final dataset considered for lineage designation comprised 1,480 HRSV-A and 1,385 HRSV-B genomes (Appendix 2 Table).

HRSV Lineage Definition

We defined HRSV lineage as a statistically supported monophyletic cluster comprising >10 sequences and characterized by >5 aa substitutions, compared to the parental lineage. The lineage-defining amino acids, present in >90% of the sequences within the clade, may be found in any of the viral proteins.

Phylogenetic classifications vary among viral species aiming to define clusters reflecting the heterogeneity of the viral population, considering each virus unique evolutionary characteristics and using arbitrary thresholds for long-term applicability (2729). Inherent bias exists in any classification system because of availability and spatiotemporal representation sequences. Therefore, our HRSV lineage definition did not include criteria of sequences from different outbreaks or countries to enable early detection of novel lineages. However, we propose establishing a threshold of >10 genomes for defining a lineage to monitor HRSV strains circulating within communities.

We observed the presence of distinctive signature amino acids shared by sequences of a phylogenetic clade in comparison to the parental lineage is a simple method to identify a new lineage. Methods (i.e., average nucleotide genetic distances, average patristic distances, or patristic distances between nodes) need phylogenies with complete datasets to define new categories, becoming complex with rapid increases of available sequences (1619). In our proposal, we initially screened different amino acid thresholds in an automated manner, ranging from 1–10 lineage-defining amino acids (Appendix 1). The number of small lineages decreased as the number of lineage-defining amino acids increased, and 5 amino acids resulted in an intermediate complexity of lineages defined for both HRSV subgroups. Furthermore, we proposed that the lineage-defining amino acids should be conserved in >90% of the genomes within a clade, considering the potential reversion in some of the genomes within highly mutated hotspot sites. We acknowledged that other numbers of genomes or amino acids thresholds could be useful, but we emphasized that the key to establishing a global consensus is clear operational guidelines and a robust classification, 2 aspects that our proposal fulfills.

HRSV Lineage Nomenclature

Figure 3

Human respiratory syncytial virus A lineage classification and seasonality. A) HRSV-A maximum-likelihood phylogenetic tree (1,480 sequences), colored by lineage classification. Black star indicates A.D lineage, defined by the 72-nt duplication in the G gene. Scale bar indicates substitutions per site. B) Simplified scheme of the lineage designation to highlight the presence of nested lineages. The amino acid changes in the F glycoprotein are listed next to lineage name and colored according to their location in the fusion protein.

Figure 3. Human respiratory syncytial virus A lineage classification and seasonality. A) HRSV-A maximum-likelihood phylogenetic tree (1,480 sequences), colored by lineage classification. Black star indicates A.D lineage, defined by the 72-nt duplication…

Figure 4

Human respiratory syncytial virus B lineages classification and seasonality. A) HRSV-B maximum-likelihood phylogenetic tree (1,385 sequences), colored according to lineage classification. Black star indicates B.D lineage, defined by the 60-nt duplication in the G gene. Scale bar indicates substitutions per site. B) Simplified scheme of the lineage designation to highlight the presence of nested lineages. The amino acid changes in the F glycoprotein are listed next to lineage name and colored according to their location in the fusion protein.

Figure 4. Human respiratory syncytial virus B lineages classification and seasonality. A) HRSV-B maximum-likelihood phylogenetic tree (1,385 sequences), colored according to lineage classification. Black star indicates B.D lineage, defined by the 60-nt…

We defined the lineage nomenclature integrating the HRSV subgroup letter and ascending ordinal numbers, separated by dots to represent nested lineages (Figure 3, panels A, B; Figure 4, panels A, B). Furthermore, we assigned a distinct nomenclature to the 72-nt (24-aa) G-gene duplication within HRSV-A and 60-nt (20-aa) G-gene duplication within HRSV-B. Those genetic events are epidemiologically relevant, because only viruses with G-gene duplication have been detected since 2017 (3033). To track those viruses, we used the alias D, specifically A.D (historically, ON1 genotype) for HRSV-A and B.D (historically, BA genotype), for HRSV-B and nested lineages with increasing ordinal numbers. In summary, letters A and B indicate the HRSV subgroup at the beginning of the lineage name, C is unused, and D serves as an alias for 72-nt and 60-nt duplication within the G gene. In addition, aliases starting from E are limited to 3 numerical levels of nested lineages, preventing indefinite accumulation of numbers. For example, B.D.4.1.1 lineage has descendant lineages named B.D.E.1–B.D.E.4 instead of B.D.4.1.1.1–B.D.4.1.1.4, where E represents 4.1.1 (Figure 4, panels A, B). The nomenclature is based on the tree topology, reflecting the order of the nodes from the root to the tips, but it is unrelated to the sequence collection date or date of the most recent common ancestor of the lineage.

To remain functional, a nomenclature system requires periodic updates as new lineages emerge. Therefore, we have established 2 open repositories on GitHub containing definitions of each lineage, signature mutations, and representative sequences. The repositories are available at https://github.com/HRSV-lineages/lineage-designation-A and https://github.com/HRSV-lineages/lineage-designation-B; they are intended to provide up-to-date definitions and serve as a platform for discussion and designation of novel lineages.

Lineages within the HRSV-A and HRSV-B Rooted Trees

We reconstructed ancestral sequences at the root of the phylogenetic trees. Although the sequences are not biologically real, they served as surrogate parental lineages during initial classification. Identifying monophyletic clusters with >10 sequences and >5 aa changes compared with the reconstructed root sequence, we defined 3 HRSV-A lineages (A.1–A.3) and 4 HRSV-B lineages (B.1–B.4). We were unable to classify 2 sequences, EPI-ISL-15771600_USA_1956 (GISAID) and MG642074_USA_1980 (GenBank), perhaps because they belong to underrepresented extinct lineages.

We further analyzed the first lineages in an iterative manner to identify nested lineages; as a result, we identified a total of 24 lineages within HRSV-A, and 16 within HRSV-B (Figure 3; Figure 4). Close to the root of the HRSV-B tree, extinct lineages were underrepresented, comprising 5 distinct amino acids (B.1, B.3, B.4). Despite the low number of sequences, we included them as lineages to trace evolutionary branches that gave rise to currently circulating lineages. In addition, A.D.2 is slightly below the sequence threshold; nonetheless, we kept the lineage category to emphasize the common ancestor among A.D.2.1 and A.D.2.2.

We scrutinized the presence and absence of the duplication in the G gene across each tree. Although patterns were mostly as expected with a single historical duplication event, some genomes within the clade with the duplication in G lacked the duplication. The dispersed association of these sequences in the phylogenetic tree, rather than the monophyletic cluster we expected, suggests the virus did not lose the nucleotide duplication (Appendix 1 Figure 4). Instead, similar read length to the duplication region of certain short-read next-generation sequencing technologies potentially masked the presence of the duplication when used in the consensus genome assembly with reference sequences that do not possess the nucleotide duplication. Therefore, we recommend using such data with quality filtered reads of a length >150-nt to avoid this problem.

Lineage-defining amino acids were present in all HRSV proteins, primarily identified within the G protein (Tables 1, 2). Also, the lineage-defining amino acids at polymerase L protein were noteworthy, contributing to the distinction of 21 of 24 HRSV-A lineages and 15 of 16 HRSV-B lineages (Tables 1, 2). Of interest, the F protein contributed to define 14 lineages in HRSV-A and 13 in HRSV-B (Figure 3, panel B; Figure 4, panel B). The G and F surface glycoproteins are likely under selection pressure from antibody-mediated immunity and exhibit a robust phylogenetic signal (18,31). Whereas the G protein displays substantial nucleotide and amino acid sequence plasticity, the F protein experiences strong negative selection, likely attributed to functional or structural constraints (34). For instance, the fusion peptide is the only region in F without lineage-defining amino acids (Figure 3, panel B; Figure 4, panel B). Although the low diversity of the F protein is promising for HRSV interventions, monitoring the F protein during global implementation is essential to estimate the antigenic impact of amino acid substitutions.

Using G and F Sequences with the HRSV Lineage Classification System

The main challenge for global expansion of HRSV genomics is the absence of a cost-effective, globally standardized and validated methodology for sequencing, in contrast to SARS-CoV-2 or influenza virus (35,36). In addition, limited funding and infrastructure cause some laboratories to prefer sequencing the G gene only (3739). Although we highly recommend using complete genomes for HRSV lineage assignment to ensure the maximum accuracy of the classification and monitor the amino acid changes in all viral proteins, partial genomes covering the G and F genes can be used because overall they reproduce the topology of the HRSV tree (17,18). We do not recommend the use of smaller G gene regions such as the second hypervariable region (250-nt length at the 3′ gene end) (Figure 1) that was used historically for molecular epidemiology because previous reports showed a decreased phylogenetic signal (17). The use of G, F, or both genes for lineage classification should rely on phylogenetic associations with reference sequences. Of note, using only G and F genes is inadequate for defining novel lineages because of the inability to detect lineage-defining amino acids across all viral proteins. Our analysis showed minimal misclassification (1.2%) in HRSV-A and none in HRSV-B when using only the G gene (Appendix 1 Figure 5). However, the G ectodomain alone resulted in an 18.86% misclassification rate for HRSV-A and none for HRSV-B. The F gene alone had misclassification rates of 38.18% for HRSV-A and 1.23% for HRSV-B because of polytomies affecting lineage assignments within A.D.1 and A.D.5. Combining G and F gene fragments reduced misclassification to 0.07% for HRSV-A and none for HRSV-B, indicating that this approach provides optimal resolution for both subgroups (Appendix 1 Figure 5).

Prospective HRSV Lineage Assignment and Definition

Assigning sequences to the existing lineages can be automated using online tools such as NextClade (https://clades.nextstrain.org) (40), ReSVidex (https://cacciabue.shinyapps.io/resvidex_wg), INSaFLU (https://insaflu.insa.pt) (41), or UShER (https://usher.bio) (42). However, to define a novel lineage, we encourage users to follow our guidelines (Appendix 1), available on GitHub (https://github.com/orgs/rsv-lineages/repositories). We anticipate new lineages of HRSV-A/B will continue to emerge, and we envision updating our proposed nomenclature to incorporate new lineages. We encourage reporting of new HRSV lineages at the RGCC GitHub page as an issue within the corresponding repository for HRSV-A/B. The RGCC study group will evaluate the newly proposed lineage and update reference alignments if confirmed.

Importantly, assigning the lineage of a query sequence does not require the use of complete genomes or the absence of nucleotide ambiguities; rather, it requires a supported association within a phylogenetic clade. However, defining a new lineage requires the use of complete genomes without ambiguities, because amino acid characterization of all viral proteins is essential.

Molecular Epidemiology of HRSV with Proposed Classification

Figure 5

Temporal distribution of HRSV-A and HRSV-B lineages. A total of 2,744 HRSV-A genomes and 2,443 HRSV-B genomes available in public databases through March 2023 were included. HRSV, human respiratory syncytial virus.

Figure 5. Temporal distribution of HRSV-A and HRSV-B lineages. A total of 2,744 HRSV-A genomes and 2,443 HRSV-B genomes available in public databases through March 2023 were included. HRSV, human respiratory syncytial…

We described the HRSV molecular epidemiology including all available genomes, even those previously discarded during the dataset curation. We analyzed the seasonality of lineages using a dataset comprising 2,277 HRSV-A and 2,058 HRSV-B genomes, revealing notable co-circulation and lineage replacement over time (Figure 5). In HRSV-A, A.1 and A.2 lineages are extinct: the last detected sequences of A.1 were collected in 1995 and of A.2 in 2015. Since 2011, A.D and nested lineages continue to circulate; A.D.2.2 and A.D.4 were detected in 2013, indicating rapid divergence of the HRSV-A viruses with the 72-nt duplication in G gene. In HRSV-B, lineages B.1, B.2, B.3, and B.4 exhibited strong lineage replacement (Figure 5). Although the B.D lineage with a 60-nt duplication in the G gene (B.D lineage) was detected in 1999, complete genomes became available in 2005 (8). By 2009, only B.D and nested lineages were detected, and since 2017, only B.D.4 and nested lineages have been observed.

HRSV lineages may have been underrepresented before the COVID-19 pandemic because of limited genomic surveillance. However, our classification system allows for updates if prepandemic genomes meeting lineage criteria are shared. Some lineages, such as A.D.3.1, A.D.5.2, and A.D.5.3 in HRSV-A and B.D.E.1 and B.D.E.3 in HRSV-B, appear to be exclusive to the postpandemic period, although most of their lineage-defining amino acid were present in parental prepandemic strains. For instance, A.D.5.2 was recognized as a distinct lineage with the emergence of the C26Y substitution in M2–2 whereas other signature amino acids were present in a 2019 parental lineage genome (GenBank accession no. MZ515825). Detection of postpandemic lineages does not contradict studies reporting no new post-pandemic genotypes because those studies relied on earlier classification systems (4346). The possibility that these new lineages circulated before the pandemic depends on the deposition of genomes.

Some of the lineages were detected in specific countries (Appendix 1 Figure 6). For example, A.D.1 descendant lineages, A.D.5.3 and most of B.D.E.4 cases were identified in Australia or New Zealand. Contemporary lineages such as B.D.4.1.1 and descendants B.D.E.1 and B.D.E.3, predominantly consisted of sequences from the United Kingdom. Global genomic surveillance bias presents a major confounding factor in lineage geodetection; for instance, most of the earliest lineages were detected in the United States, the principal contributor of HRSV genomes until 2007 (Appendix 1 Figures 2, 6).

Consensus classification of HRSV below the subgroup level has been a challenge for multiple decades. Collaboratively, the HRSV molecular evolution research community, along with experts in the evolution of other respiratory viruses, have worked toward establishing a unified global classification system in the initiative HRSV Genotyping Consensus Consortium (RGCC). Our proposal categorizes HRSV-A/B sequences into lineages based on phylogenetic associations and amino acid markers, relying on complete genomes. Partial or low-quality genomes can be assigned to the existing lineages, emphasizing the robustness of this system. We developed standard guidelines for lineage definition and assignment and created online resources for updates, ensuring long-term utility. Defining a viral category below species through a phylogenetic-based classification is challenging; the system must exhibit reproducibility, balance complexity, and be updatable to capture the level of heterogeneity useful for viral surveillance. Our proposal addresses those requirements comprehensively.

HRSV is not an emerging virus; it generates annual outbreaks with co-circulation and replacement in the prevalence of its antigenic subgroups. Although some RSV genomes were collected from clinical samples >50 years ago, the largest increase in the number of genomes has occurred since 2021. A limitation of our definition is the uncertainty of the antigenic effect of individual amino acid substitutions on lineages. Hence, whole-genome surveillance together with the study of lineage-phenotype association are essential, as observed in genetic and antigenic characterization in influenza to estimate the effectiveness of immunization (47). In 2023, recombinant F protein vaccines were approved; as their implementation progresses, we will learn how the vaccines affect viral evolution. We expect our unification proposal for the phylogenetic classification of HRSV to support spatiotemporal comparative lineage surveillance and detection of emerging lineages. In addition, we anticipate studies of association between lineages and the severity of HRSV disease, as well as associations of particular lineages with patients’ demographic characteristics.

Continue Reading