SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes

Summary Despite its overwhelming clinical importance, the SARS-CoV-2 gene set remains unresolved, hindering dissection of COVID-19 biology. Here, we use comparative genomics to provide a high-confidence protein-coding gene set, characterize protein-level and nucleotide-level evolutionary constraint, and prioritize functional mutations from the ongoing COVID-19 pandemic. We select 44 complete Sarbecovirus genomes at evolutionary distances ideally-suited for protein-coding and non-coding element identification, create whole-genome alignments, and quantify protein-coding evolutionary signatures and overlapping constraint. We find strong protein-coding signatures for all named genes and for 3a, 6, 7a, 7b, 8, 9b, and also ORF3c, a novel alternate-frame gene. By contrast, ORF10, and overlapping-ORFs 9c, 3b, and 3d lack protein-coding signatures or convincing experimental evidence and are not protein-coding. Furthermore, we show no other protein-coding genes remain to be discovered. Cross-strain and within-strain evolutionary pressures largely agree at the gene, amino-acid, and nucleotide levels, with some notable exceptions, including fewer-than-expected mutations in nsp3 and Spike subunit S1, and more-than-expected mutations in Nucleocapsid. The latter also shows a cluster of amino-acid-changing variants in otherwise-conserved residues in a predicted B-cell epitope, which may indicate positive selection for immune avoidance. Several Spike-protein mutations, including D614G, which has been associated with increased transmission, disrupt otherwise-perfectly-conserved amino acids, and could be novel adaptations to human hosts. The resulting high-confidence gene set and evolutionary-history annotations provide valuable resources and insights on COVID-19 biology, mutations, and evolution.


Supplementary Figures
Supplementary Figure S1. Alignment of nsp11 and frameshift site.
Alignment of nonstructural protein nsp11 and subsequent 5 codons in 44 Sarbecoviruses (top) and 52 coronaviridae (bottom, green dots for strains included in both). Programmed frameshift slippery site (red rectangle) is perfectly conserved in all genomes. Pol shares the 5' nine codons of nsp11 but then continues 3' of the slippery site in a different reading frame. The four codons 3' of the slippery site are perfectly conserved in Sarbecoviruses, consistent with a dual coding region, but nsp11 is only 13 codons long and its stop codon is poorly conserved in coronaviridae (cyan, magenta, and yellow stop codons), suggesting that it does not produce a functional protein.  NC_019843_Middle_East_respiratory_syndrome_CoV CAA TCT AAA GAT TCC AAT TTT TTA AAC GAG TCC GGG GTT CTA TTGTAA ATG CCC GAA TAG  NC_038294_BetaCoV_England_1 CAA TCT AAA GAT TCC AAT TTT TTA AAC GAG TCC GGG GTT CTA TTGTAA ATG CCC GAA TAG  NC_034440_Bat_CoV_PREDICT_PDF_2180 CAG TCC AAG GAT TCT AAT TTT TTA AAC GAG TCC GGG GTT CTA TTGTAA ATG CCC GAA TAG  NC_009019_Bat_CoV_HKU4_1 CAA TCC AAA GAT ACT AAT TTT TTA AAC GAG TCC GGG GTT CTA GTGTAA ATG CCC GAC TAG  NC_009020_Bat_CoV_HKU5_1 CAG TCC AAA GAT TCC AAC TTT TTA AAC GAG TCC GGG GTT CTA TTGTAA ATG CCC GAA TAG  NC_039207_BetaCoV_Erinaceus_VMC_DEU_174_GER_2012 CAT TCC AAA GAC ACT AAT TTT TTA AAC GAG TCA GGG GTT CTA TTGTAG ATG CCC GAA TAG  NC_001846_Mouse_hepatitis_MHV_A59_C12_mutant CAG TCA AAA GAC ACG AAC TTT TTA AAC GGG TTC GGG GTA CAA GTGTAA ATG CCC GTC TTG  NC_012936_Rat_CoV_Parker CAG TCA AAA GAC ACG AAC TTT TTA AAC GGG TTC GGG GTA CAA GTGTAA ATG CCC GTC TTG  NC_003045_Bovine_CoV CAG TCA AAA GAT ACT AAT TTT TTA AAC GGG TTC GGG GTA CGA GTGTAG ATG CCC GTC TCG  NC_006213_Human_CoV_OC43_ATCC_VR_759 CAA TCA AAA GAT ACT AAT TTT TTA AAC GGG TTC GGG GTA CGA GTGTAG ATG CCC GTC TCG  NC_017083_Rabbit_CoV_HKU14 CAG TCA AAA GAC ACT AAT TTT TTA AAC GGG TTC GGG GTG CGA GTGTAG ATG CCC GTC TCG  NC_026011_BetaCoV_HKU24_HKU24_R05005I CAG TCA AAG GAT ACG AAC TTT TTA AAC GGG TTC GGG GTG  a. Phylogenetic tree computed from ORF8 alignment differs from genome-wide tree in Extended Data Fig. 2, indicating there has been recombination. b. Alignment of ORF8 with strains ordered according to the ORF8 tree shows radical amino acid substitutions (red) disrupting almost every residue. Although the apparent high rate of nucleotide-level evolution in ORF8 when estimated using the whole genome tree (3.9 substitutions per site, 6.2 per four-fold degenerate site) is partly an artifact of estimation along a tree topology that does not match the history of ORF8, it remains very high (2.1 and 3.7 respectively, Supplementary Table S2) when estimated using the ORF8-specific phylogeny, substantially higher than other ORFs (e.g. 1.1 and 2.8 respectively for ORF1ab when restricting to the 42 strains aligned in ORF8), and thus is not entirely an artifact of recombination.  KF367457_Bat_SARS_like_CoV_WIV1 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACT TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CTT GAA GAC CCA TGT CCA ATA CAT TAC  KY417147_Bat_SARS_like_CoV_Rs4237 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACT TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CTT GAA GAC CCA TGT CCA ATA CAT TAC  KY417151_Bat_SARS_like_CoV_Rs7327 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACA TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA TCA TTC CAA CTT GAA GAC CCA TGT CCA ATA CAT Figure S5. SNV-depleted regions.
UCSC Genome Browser images of regions in nsp3 (a) and S2 (b). Most amino acids in these regions are conserved (red rectangles in Conserved AAs track), but the only missense variants (light red rectangles in SNVs track) disrupt non-conserved amino acids (variants disrupting conserved amino acids would be bright red if present). The lack of missense variants in such a large set of conserved amino acid residues could indicate that constraint in the Sarbecovirus clade has continued particularly strongly in the SARS-CoV-2 population. However, although these are the most depleted regions in the genome for missense variants in conserved amino acid residues, neither depletion is statistically significant, even without any correction for multiple region lengths searched (nominal p=0.072 and p=0.093, respectively).

Supplementary Text S2 Evidence that ORFs 3d and 3b are not protein-coding
ORF3b is a 22 amino acid ORF having coordinates 25814-25882, orthologous to the 5' end of SARS-CoV ORF3b, which is a 154 amino acid ORF whose various Sarbecovirus orthologs are truncated by numerous in-frame stop codons. Its start codon is conserved in all but one of our 44 Sarbecovirus strains, but its stop codon is only present in SARS-CoV-2 and its three closest relatives, and the ORF length is highly variable, so the SARS-CoV-2 form is not (Extended Data Fig. 6). Overexpression in a human cell line of the SARS-CoV-2 ORF was found to have anti-IFN-I activity 13 . However, the PhyloCSF score per codon of this truncated ORF is strongly negative (-18.0), it does not overlap any SCEs (Fig. 2b), and even among the four closely related strains sharing this stop codon all six substitutions are radical amino acid changes, providing no evidence that this amino acid sequence has been under purifying selection. There is no TRS in the 5' neighborhood of the ORF3b start codon, and in order for ORF3b to be translated by leaky scanning from the subgenomic RNA for ORF3a, the ribosome would have to bypass eight AUG codons, including several with moderate or strong Kozak context. Finally, ribosome profiling and transcription studies did not find translation of ORF3b or substantial transcription of a subgenomic RNA from which it could be translated 10,15-17 .
A different ORF, the 57 amino acid ORF with coordinates 25524-25697, has also been referred to as ORF3b 6,7 . We refer to it as ORF3d to avoid confusion with the 22 amino acid ORF. ORF3d has premature stop codons in all of the other strains of our Sarbecovirus alignment, indicating unambiguously that it does not encode a conserved protein (Extended Data Fig. 7), but some authors have suggested it is a de novo protein in SARS-CoV-2 based on a sequence analysis method 12 and a statistical test for unexpectedly long overlapping ORFs 11 . On the other hand, analysis of 2,784 SARS-CoV-2 sequences from the GISAID database (Elbe and Buckland-Merrett 2017) found that 17.6% of isolates contain a G to T mutation in position 25563, which introduces a stop codon into the reading frame of ORF3d 7 , militating against de novo functional translation in SARS-CoV-2. Nor is there convincing experimental evidence that ORF3d is translated. A ribosome profiling study found no evidence of initiation at the start codon of ORF3d in samples collected five hours post infection 10 . Footprints were found at the ORF3d start codon in a reanalysis of footprints from the same study collected 24 hours post infection in samples treated with harringtonine and LTM, which typically would indicate initiating ribosomes 11 ; however, the majority of footprints from the 24-hour sample are thought not to be generated by ribosome protection, possibly instead originating from protection by the N protein, and so are not an indication of translation 10 . A subset of ORF3d in the same reading frame, 33 amino acid 3a.iORF2, that begins at a downstream AUG and does not include the nonsense mutation, was one of several overlapping ORF candidates predicted to be translated using ribosome profiling 10 ; however, this ORF is not conserved in any other strain and it is unclear how it could be translated: translation by leaky scanning from the ORF3a subgenomic RNA would require the ribosome to bypass 4 earlier AUG codons, and there is no TRS near its start codon that might give rise to a subgenomic RNA specialized for this ORF.

Supplementary Text S3 Search for other novel protein-coding genes
As discussed in the main text, we computed PhyloCSF scores for all 67 hypothetical non-NCBI-annotated AUG-to-stop locally-maximal SARS-CoV-2 ORFs ≥25 codons long and found that the top scoring candidate, ORF3c, is protein-coding.
Continuing down the sorted list in order of decreasing PhyloCSF score per codon, we found the candidate ORF with the next best score (-2.74) is a 32-codon ORF (26183-26278) that overlaps the 3' end of ORF3a and the 5' end of E (Supplementary Fig.  S4a) and is orthologous to the 5' end of SARS-CoV ORF3b. Two strains have a frameshifting 1-base deletion within the ORF, and two others have premature stop codons. None of the substitutions are synonymous. There is high nucleotide-level constraint, but it continues on both sides of the ORF, suggesting it results from something other than translation of the ORF. Overall, this ORF does not show the evolutionary signature of a functional coding sequence. Next in the list is ORF9b which we have discussed elsewhere. Fourth is a 31-codon ORF (3207-3299) overlapping ORF1a, having PhyloCSF score -7.77 (Supplementary Fig. S4b). Most of this ORF consists of a 75-nt insertion that is only present in SARS-CoV-2, RaTG13, and CoVZC45, and the start and stop codons are missing in CoVZC45, so this is not a conserved coding sequence. Finally, the fifth-ranked candidate is ORF9c, which we have discussed elsewhere.
The relatively high scores of ORFs 9b and 9c among these 67 hypothetical ORFs are, in part, an artifact of the low density of substitutions throughout N, which they both overlap. This low density, which is found even in the parts of N that are not in ORFs 9b or 9c, decreases the number of substitutions available to PhyloCSF for distinguishing its coding and noncoding evolutionary models, which compresses the PhyloCSF score towards 0, resulting in a better rank among the negative scores. If we compensate for this by dividing by the average number of substitutions per site in the ORF, ORFs 9b and 9c, while still in the top half, move down to the 92nd and 83rd percentile among the 67 ORFs considered, whereas ORF3c remains the best scoring-candidate (Extended Data Fig. 8).
To search for additional novel protein-coding regions, we relaxed our criteria to include ORFs with at least 10 codons, allow near-cognate start codons, and allow ORFs contained within another ORF in the same frame. The most promising were two noncanonical ORFs in the 5'-UTR with slightly positive PhyloCSF score, 14-codon CUGinitiated NC_045512.2:92-133 and 31-codon AGG initiated NC_045512.2:158-250. However, neither are among the translated ORF candidates identified by ribosome profiling 10 , and the evolutionary evidence is not strong enough to consider it likely that such short non-canonical ORFs generate proteins. Because it has been conjectured that translation might occur on the large number of negative-strand genomic and subgenomic RNAs that are intermediates in viral gene expression and replication in positive-strand RNA viruses 56,57 , we also scored ORFs on the negative strand, but again found no convincing candidates. Supplementary Table S4 contains the complete list of ORFs, with scores and other pertinent information.
2019 (Wu et al.), thus requiring relatively fewer human-adaptive mutations compared to other genes whose biological functions would adapt to human hosts only later (noting however, that only a subset of mutations in the current pandemic are likely adaptive). • The frequent recombination observed in S1 suggests an alternative explanation.
It is possible that recombination events into our selected 44-strain phylogeny from more distant relatives could have increased the number of inter-strain differences in these genes. In that case, the paucity of mutations in S1 and nsp3 relative to their inter-strain differences might be due to inflation of the latter rather than deflation of the former. However, we note that the amount of distant recombination needed to account for the large discrepancy observed might be implausibly large. • Lastly, inter-strain differences reflect selective pressures that have acted over evolutionary time scales in which even mildly deleterious mutations are excluded, while within-strain differences reflect smaller evolutionary time scales over which only strongly deleterious mutations are excluded. Thus, the discrepancy observed could also result if the fraction of all possible deleterious amino acid changes of S1 and nsp3 that are strongly deleterious rather than mildly deleterious is sufficiently larger than that fraction for other proteins.

Supplementary Text S7 Enriched and depleted clusters of missense SNVs
Other than the Nucleocapsid region that we have already discussed, there are no regions in the genome in which variants disrupting conserved amino acid residues are significantly denser than would be expected by chance, given the total number of such variants in each gene. Nor are there any regions that are significantly depleted for such variants, which would have indicated regions in which constraint in the Sarbecovirus clade has continued particularly strongly in the SARS-CoV-2 population. The regions that are most depleted (though not statistically significantly) have coordinates 7400-7840 in nsp3 with no missense variants among 103 conserved amino acids and 24437-24748 in S2 with no missense variants among 99 conserved amino acids (p=0.072 and p=0.093, respectively, without any correction for multiple region lengths searched) (Supplementary Fig. S5).