Sarbecovirus comparative genomics

Despite its overwhelming clinical importance for understanding and mitigating the COVID-19 pandemic, the protein-coding gene content of the SARS-CoV-2 genome remains unresolved, with the function and even protein-coding status of many hypothetical proteins unknown and often conflicting among different annotations, thus hindering efforts for systematic dissection of its biology and the impact of recent mutations. Comparative genomics is a powerful approach for distinguishing protein-coding versus non-coding functional elements, based on their characteristic patterns of change, which we previously used to annotate protein-coding genes in human, fly, and other species. Here, we use comparative genomics to provide a high-confidence set of SARSCoV-2 protein-coding genes, to characterize their protein-level and nucleotide-level evolutionary constraint, and to interpret the functional implications for SARS-CoV-2 mutations acquired during the current pandemic. We select 44 complete Sarbecovirus genomes at evolutionary distances well-suited for protein-coding and non-coding element identification, create whole-genome alignments spanning all named and putative genes, and quantify their protein-coding evolutionary signatures using PhyloCSF and their overlapping constraint using FRESCo. We find strong protein-coding signatures for all named genes and for hypothetical ORFs 3a, 6, 7a, 7b, and 8, indicating protein-coding roles, and provide strong evidence of protein-coding status for a recently-proposed alternate-frame novel ORF within 3a. By contrast, ORF10 shows no protein-coding signatures but shows unusually-high nucleotide-level constraint, indicating it has important but non-coding functions, and ORF14 and SARS-CoV-1 ORF3b, which overlap other genes, lack evolutionary signatures expected for dual-coding regions, indicating they do not produce functional proteins. ORF9b has ambiguous protein-coding signatures, preventing us from resolving its protein-coding status. ORF8 shows extremely fast nucleotide-level evolution, lacks a known function, and was deactivated in SARS-CoV-1, but shows clear signatures indicating protein-coding function worthy of further investigation given its rapid evolution and potential role in replication. SARS-CoV-2 mutations are preferentially excluded from evolutionarily-constrained amino acid residues and synonymously-constrained nucleotides, indicating purifying constraint acting at both coding and non-coding levels. In contrast, we find a conserved region in the nucleocapsid that is enriched for recent mutations, which could indicate a selective signal, and find that several spike-protein mutations previously identified as candidates for increased transmission and several mutations in isolates found to generate higher viral load in-vitro disrupt otherwise-perfectly-conserved amino-acids, consistent with adaptations for human-to-human transmission. Introduction Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the virus responsible for the COVID-19 pandemic (Wu et al. 2020), is a betacoronavirus in the subgenus Sarbecovirus, which also includes SARSCoV-1 (also known as SARS-CoV), the strain responsible for the 2003 SARS outbreak. The large and complex positive-strand RNA genome of SARS-CoV-2 consists of approximately 30,000 nucleotides, and encodes approximately 30 known or hypothetical mature proteins (Fig. 1A, Fig. 3). Despite its extreme medical importance, its gene content remains surprisingly unresolved, with several hypothetical open reading frames (ORFs) whose function or even protein-coding status is unknown. More than two-thirds of the SARS-CoV-2 genome is spanned by a large open reading frame (ORF1ab), which includes an internal programmed translational frameshift triggered by a translation-slippery sequence UUUAAAC and a downstream RNA pseudoknot structure that is generally conserved among coronaviruses


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the virus responsible for the COVID-19 pandemic (Wu et al. 2020), is a betacoronavirus in the subgenus Sarbecovirus, which also includes SARS-CoV-1 (also known as SARS-CoV), the strain responsible for the 2003 SARS outbreak. The large and complex positive-strand RNA genome of SARS-CoV-2 consists of approximately 30,000 nucleotides, and encodes approximately 30 known or hypothetical mature proteins (Fig. 1A, Fig. 3). Despite its extreme medical importance, its gene content remains surprisingly unresolved, with several hypothetical open reading frames (ORFs) whose function or even protein-coding status is unknown.
More than two-thirds of the SARS-CoV-2 genome is spanned by a large open reading frame (ORF1ab), which includes an internal programmed translational frameshift triggered by a translation-slippery sequence UUUAAAC and a downstream RNA pseudoknot structure that is generally conserved among coronaviruses   TAC TGA GTT AAG TTT TCT CCT--TAG AGC CTC CTA AGG TTT TTT  Dog GAA ACG TAC TTA AAG AAG AAG CGT GTA TAC TGA GCT AAG TTC TCT CCA--TAT AAT CTC ATA AGG TTT TTT  Microbat GAG ACC TAC TTA AAA AAG AAG CGT GTG TAC TGA GCT AAG TTC TCT CCTTA TAT AAC CTC CTA TGG CTC TTT  Megabat GAG ACC TAC TTA AAA AAG AAG CGA GTG TAC TGA GCT AAG TTC TTT CCTTA TAT AAC CTC ATA AGG TTC TTT  Hedgehog GAA ACC TAC TTC AAA AAG AAG CGA GTA TAC TGA GTT CAG TTC TCT CCT--CCT AAC TTC CCA TGG TTC TCT  Elephant GAA ACC TAC TTA AAG AAG AAG CGA GTG TAC TGA GTG AAG CTC T--CCC--TCT AAC CTC ATG TGG TCC   We select 44 complete and closely-related coronavirus genomes at ideally-suited evolutionary distances (approximately that of mammalian species), generate whole-genome alignments spanning all known genes and hypothetical ORFs, and use them to evaluate protein-coding constraint in all reading frames, and nucleotide-level constraint in synonymous codon positions. We find that five hypothetical ORFs are not conserved protein-coding genes, namely ORFs 10 and 14, and SARS-CoV-1 ORFs 3b, 8a, and 8b, and we confirm protein-coding evolutionary signatures for other hypothetical ORFs (3a, 6, 7a, 7b, 8) and a recentlyproposed alternate-frame ORF within 3a; this includes ORF8 despite its yet-unknown function and extremelyrapid evolutionary rate. We also annotate 1394 synonymously-constrained codons within protein-coding regions, which are indicative of overlapping constrained elements that might include dual coding regions, binding sites for RNA-binding-proteins, and RNA structures known to help regulate coronavirus replication, transcription, and translation. We use these protein-level and codon-level annotations to classify 1800 singlenucleotide variants across 17,000 isolates from the current pandemic, yielding insights into mutations that are likely benign vs. those that disrupt evolutionarily-conserved protein-coding or non-coding functions. In particular, we find that several spike-protein variants recently-associated with increased transmission disrupt perfectly-conserved amino-acids, possibly representing novel adaptations to human hosts. These comparative genomics annotations provide a general resource for prioritizing functional variants and strains, for vaccine development and specialization, and for untangling the molecular biology of SARS-CoV-2.

Species selection and alignment of 44 Sarbecovirus genomes
We selected 44 complete Sarbecovirus genomes at an evolutionary distance well-suited for identifying proteincoding genes and non-coding selection within them, consisting of SARS-CoV-2, SARS-CoV-1, and 42 bat coronavirus genomes (Fig. 2, Supplemental Table S1). Betacoronavirus genomes outside the Sarbecovirus clade, such as MERS-CoV, are too different from SARS-CoV-2 to be usable for this purpose, and even the closest relative, Hibecovirus Bat Hp-betacoronavirus/Zhejiang2013, shows no detectable homology across ORFs 6, 7a, 7b, and 8. Conversely, among different isolates of the SARS-CoV-2, SARS-CoV-1, and some bat strains, evolutionary distances are too small for reliable evolutionary signatures.  Figure 2. Phylogenetic tree of 44 Sarbecovirus genomes. Left: Phylogenetic tree of a selection of Coronaviridae genomes, including the seven that infect humans (red asterisks). Right: Phylogenetic tree of the 44 Sarbecovirus genomes used in this study. Trees are based on whole-genome alignments and might be different from the history at particular loci, due to recombination.
We created a genome-wide alignment of these 44 genomes, spanning a total phylogenetic branch length of about 3 substitutions per 4-fold degenerate site, comparable to the 29-mammals and 12-flies alignments we previously used for protein-coding gene identification. This rate varies across the SARS-CoV-2 genome from 0.2 in ORF10, 1.2 in E, and 6.5 in S (Supplemental Table S2), though the latter may be inflated due to genomic segments whose histories do not match the whole-genome tree.

Scoring of protein-coding and synonymous constraint
To detect protein-coding evolutionary signatures and distinguish regions that evolved under protein-coding constraint, we previously developed PhyloCSF (Lin et al. 2011a), which compares codon substitutions and frequencies in alignments of closely-related genomes to coding and non-coding evolutionary models trained on whole genome data (Fig. 1B), and CodAlignView (I Jungreis, MF Lin, CS Chan, M Kellis 2016) to enable manual curation by visual exploration of the corresponding alignment for substitutions, stop codons, and insertions or deletions indicative of the protein-coding status of a region. These tools have been widely used to identify novel protein-coding regions in human (Lindblad-Toh et al. 2011;Mudge et al. 2019), fly (Lin et al. 2007), and yeast (Lin et al. 2011a), to discover stop-codon readthrough Lindblad-Toh et al. 2011;Jungreis et al. 2011;Lin et al. 2007;Loughran et al. 2014), and to distinguish protein-coding vs. noncoding genes in human (McCorkindale et al. 2019;Frankish et al. 2019).
We computed PhyloCSF protein-coding scores for every three-nucleotide interval, in all three reading frames of SARS-CoV-2, smoothed using a hidden Markov model, and created tracks for the UCSC Genome Browser quantifying protein-coding evolutionary signatures along the genome, as we previously did for the human genome (Mudge et al. 2019). We also computed an overall PhyloCSF score for each known protein and hypothetical ORF (Supplemental Table S2). We used CodAlignView to create visualizations of the alignment of each ORF, highlighting two signatures of mutations that are tolerated in protein-coding genes across evolutionary time: first, a preference for synonymous substitutions typical of third codon positions and conservative amino acid changes that preserve biophysical properties; second, avoidance of insertions and deletions that are not multiples of 3, as they would disrupt the reading frame of translation, whereas gaps that remove complete codons and preserve the reading frame are more tolerated. We have provided CodAlignView images (Supplemental Materials) and links for manual exploration (Supplemental Table S2) for each annotated ORF and mature protein.
We also previously developed FRESCo and other software tools for detecting overlapping nucleotide-level constraint within protein-coding regions (Lin et al. 2011b;Sealfon et al. 2015), evidenced by fewer synonymous substitutions, and reflective of overlapping functional elements. Such elements can include dual-coding regions that encode multiple proteins in different reading frames, which are common in viruses with compact genomes (Firth 2014) but also found in other species including human (Lin et al. 2011b;Khan et al. 2020); RNA structures encoded through complementary nucleotide stretches, which are known to play important roles in subgenomic RNA generation and other coronavirus functions; and binding sites for RNA-binding proteins, which can be recognized by virus-encoded proteins or host-encoded proteins, to regulate transcription, processing, and translation of viral mRNAs. FRESCo has been applied to diverse virus species (Sealfon et al. 2015) as well as humans (Khan et al. 2020).
We used FRESCo to calculate the rate of synonymous substitutions in each codon of our alignment, and to recognize synonymous constraint elements (SCEs) within each NCBI-annotated SARS-CoV-2 gene, based on significantly-decreased synonymous rate in 9-codon windows relative to the gene average (Fig. 3).

Comparative evidence of protein-coding constraint for nsp proteins and named genes
We found a clear PhyloCSF signal for nsp1-nsp10 and for nsp12-nsp16 (Supplemental Table S2), with a change in the indicated translation reading frame at the known programmed frameshift site (Fig. 3A). For the 13-codon nsp11 ORF, the first 9 codons are in the same reading frame as nsp12 (Pol), and the remaining 4 codons are perfectly conserved in Sarbecovirus (Supplemental Fig. S1A), but poorly conserved in p 6 betacoronaviruses beyond Sarbecovirus (Supplemental Fig. S1B), suggesting that the 13-amino-acid peptide is not performing a conserved function.
The named genes E, M, and N are well-conserved across the 44 Sarbecovirus genomes, with strong overall alignment and very strong PhyloCSF scores, as expected. The S protein shows an unusual evolutionary signal that indicates a history of extremely-rapid evolution, subject to frequent substitutions and recombinations across its phylogeny, resulting in near-zero nucleotide-level conservation scores as measured by phyloP (Pollard et al. 2010) and phastCons (Siepel et al. 2005) over its first half (S1), while the second half (S2) is well-conserved (Fig. 3a). However, for protein-coding constraint, both S1 and S2 show very strong PhyloCSF scores, indicating that despite its rapid evolution, S remains strongly selected to preserve a protein-coding function, and highlighting the power of PhyloCSF to recognize protein-coding constraint despite rapid nucleotide evolution.  shows the polyprotein at the 5' end of part A. (A) There is a strong PhyloCSF signal in the correct frame for each of the named genes 1ab (polyprotein, red), S (spike, green), E (envelope, blue), M (membrane, orange), and N (nucleocapsid, purple), confirming that they have been under protein-coding constraint in the Sarbecovirus clade. The signal changes frame as expected at the programmed frameshift site in 1ab. There is a strong PhyloCSF signal throughout S despite the lack of nucleotide conservation at the 5' end. (B) There is a clear PhyloCSF signal in the correct frame for unnamed ORFs 3a (dark purple), 7a (yellow), 7b (blue), and 8 (light purple), despite the complete lack of nucleotide conservation in 8. There is a clear signal in the 5' half and 3' quarter of 6 (cyan), but it is weaker in the third quarter of the protein, indicating that this portion has been less constrained. There is no signal for 10 (gray), indicating that it is not a conserved protein-coding region despite high nucleotide conservation. ORFs 9b (dashed orange) and 14 (dashed green) overlap the nucleocapsid phosphoprotein N in an alternate reading frame. If these were conserved coding regions, we would expect to find synonymous constraint through most of the ORF and, possibly, some PhyloCSF signal in the alternate frame, but there is no such PhyloCSF signal in either, and there are no synonymous constraint elements in ORF14. There are two small synonymous constraint elements for 9b, leaving its status as a functional ORF ambiguous. The SARS-CoV-1 annotations also include 3b (dotted red) overlapping 3a in another frame, but most of the ORF is not synonymously constrained and it contains numerous stop codons in other strains so it cannot be a conserved coding region. A frameshifting deletion in 8 occurred in SARS-CoV-1 during the SARS outbreak, creating fragments 8a and 8b (red oval) in some isolates, but there was insufficient evolutionary time for our methods to distinguish if the fragments were still protein-coding. (C) The polyprotein, 1ab, is processed into 16 mature peptides. The PhyloCSF signal shows that all are functional proteins except possibly nsp11 (red circle), which is only 13-amino-acids long and shares its first eight codons with Pol before the latter shifts to a different reading frame.
ORFs 3a, 6, 7a, 7b, and 8 are protein-coding, but ORF10 is a non-coding functional element. Among the six unnamed ORFs annotated by NCBI (Supplemental Table S2), we found clear positive PhyloCSF scores for 3a, 7a, 7b, and 8, indicating conserved protein-coding regions, functional at the aminoacid level (Fig. 3). For ORF6, the first half and last quarter shows strong PhyloCSF signal (Fig. 3B), indicating that it encodes a conserved, functional protein, despite a less-constrained intermediate portion, and an overall near-zero average score per codon (-0.3, Fig. 1C).
ORF8 shows near-zero nucleotide-level conservation scores as measured by phyloP and phasCons, and it has no well-established function, suggesting at first glance that it might be non-functional. However, PhyloCSF shows a positive protein-coding signal (average score 4.61 per codon), and long stretches of strong proteincoding conservation, indicating that it produces a functional protein despite its rapid nucleotide-level evolution. The apparent high rate of nucleotide-level evolution in ORF8 (3.9 substitutions per site, 6.2 per 4-fold degenerate site) is in part an artifact of its history of recombination events that result in a different tree from the rest of the genome (Supplemental Fig. S2), but even after computing rates in a tree representing the history of ORF8, its rate continues to be very high (2.1 and 3.7, respectively) compared to other ORFs (e.g. 1.1 and 2.8 respectively for ORF1ab when using the whole-genome tree excluding two strains that have no alignment in ORF8, to ensure an apples-to-apples comparison, Supplemental Table S2). wuhCor1 26,000 26,500 27,000 27,500 28,000 28,500 29,000 29,500 N 10 8 6 7a 7b 9b 14 p 10 synonymous), was not statistically significant (nominal p-value>0.18, even without the needed multiple hypothesis correction), only used five closely-related genomes, and excluded a sixth genome that contained a frameshifting indel that would provide strong evidence against protein-coding function if it is not a sequencing error (Supplemental Fig. S3B). Overall, the prior evidence is insufficient to argue for protein-coding function for ORF10, and thus we conclude that ORF10 is not protein-coding, given our strong comparative genomics evidence against protein-coding constraint.

ORF14 is not a conserved coding region, and ORF9b is ambiguous
We next investigated the coding potential of two additional hypothetical ORFs annotated by UniProt, ORF 9b (97 amino acids) and ORF14 (73 amino acids), which overlap the nucleocapsid phosphoprotein (N) in a different reading frame. In neither case is there a PhyloCSF signal in the alternate frame (Fig. 3B, Supplemental Table S2). While dual coding regions often contain segments having a PhyloCSF signal in the alternate frame, such as those in human POLG (Khan et al. 2020), the lack of such signals does not provide a definite negative answer because coding constraint in the main frame alters the pattern of substitutions in the alternate frame, which can depress the PhyloCSF score. Instead, we examined the rate of synonymous substitutions in the reading frame of the nucleocapsid protein, since coding in the alternate frame would be expected to impose overlapping constraint that suppresses synonymous substitutions in the main frame. We find no significant SCEs within ORF14 (Fig. 5). Furthermore, its start codon is lost in one strain, and most strains have a stop codon three codons before the ORF14 stop (Supplemental Fig. S4). Nor were the subgenomic RNA fragments needed to express ORF14 found in the above-mentioned direct-RNA sequencing experiments (Kim et al. 2020;Taiaroa et al.). We conclude that ORF14 does not encode a functional protein.
The evolutionary evidence for ORF9b is more ambiguous. On the one hand, we do not find significant synonymous constraint in most of the portion of the main frame overlapping 9b, and some segments have synonymous level well above the gene-wide average of the nucleocapsid protein (Fig. 5); on the other hand, this region does contain two small SCEs. We note that FRESCo calculates synonymous constraint relative to the gene-wide average, and N has fewer synonymous substitutions per 4-way synonymous site than most of the rest of the genome (Supplemental Table 2), making it more difficult for a constrained region to achieve significance. Both the start and stop codons of 9b are perfectly conserved among our 44 Sarbecovirus strains (Supplemental Fig. S5), but conservation of these codons could be due primarily to constraint on the amino acid sequence of the overlapping nucleocapsid protein rather than any constraint on ORF9b itself; indeed, there is only one hypothetical single nucleotide change to the start codon of 9b, ATG->ACG, that would preserve the amino acid sequence of N, and no such changes to its stop codon. There are no premature stop codons in any of the other strains, though again that provides only weak evidence that 9b is coding because 9b is short enough that this could occur by chance. Finally, the start codon of 9b has a strong Kozak context, with A in position -3 and G in position +4, which are believed to be the optimal nucleotides at these positions for ribosomal recognition of the start codon. In contrast, the start codon of the nucleocapsid phosphoprotein, which is only 10 nt 5' of the start codon of 9b, has a weaker Kozak context, with A in position -3 and U in position +4. This leaves open the possibility that the ribosome might initiate translation of 9b from the same subgenomic RNA as N via leaky scanning, making these two proteins in a fixed ratio. If 9b does encode a protein, the high synonymous rate in some overlapping segments of the main frame would indicate that the amino acid sequence encoded by the corresponding segments in 9b are poorly constrained. Proteomics experiments have detected the hypothetical protein product of ORF9b (Davidson et al. 2020;Bojkova et al. 2020), and there is experimental evidence that ORF9b in SARS-CoV-1 localizes to mitochondria and interferes with host cell antiviral response (Shi et al. 2014), so without clear evolutionary evidence one way or the other it seems likely that ORF9b does produce a functional protein, though one with a poorly-conserved amino acid sequence. Rate of synonymous substitutions in 9-codon windows within the nucleocapsid phosphoprotein (N), normalized to make the gene-wide average 1. ORFs 9b and 14 (bottom gray rectangles) are hypothetical protein-coding regions within N in a different reading frame. We expect dual coding regions to be synonymously constrained, but there are no significant synonymous constraint elements in ORF14, and only two small ones in ORF9b (red), making it unlikely that ORF14 is a true protein-coding region, and leaving the status of ORF9b ambiguous.

A novel alternate frame protein-coding ORF within ORF3a
We next searched for novel conserved protein-coding regions by scoring all 67 hypothetical AUG-to-stop SARS-CoV-2 ORFs of at least 25 codons that do not overlap an NCBI-annotated ORF in the same frame and that are not contained in a longer ORF in the same frame. None of these had a positive PhyloCSF score, but we investigated the top candidates with the least negative score based on conservation of the start and stop codons, absence of in-frame stop codons and frameshifting indels, and evidence of synonymous constraint in the overlapping coding region. The candidate with the highest PhyloCSF score per codon (-1.21) is a 41-codon ORF (positions 25457-25579), that overlaps ORF3a in an alternate frame near its 5' end ( Fig. 6). Although the score is negative, it is 2.57 standard deviations higher than the average over hypothetical non-coding ORFs (mean: -17.9, stdev: 6.5, p = 0.005 under normal approximation, Fig. 1C), but closer to the distribution of protein-coding ORFs (mean: 8.03, stdev: 5.55, deviation: -1.67 standard deviations). As this ORF overlaps a known coding gene in an alternate frame, constraint on the known amino acid sequence suppresses synonymous substitutions in the alternate frame, which lowers the PhyloCSF score, so we would expect a lower PhyloCSF score than for nonoverlapping protein-coding regions that are subject to the same level of protein-coding constraint. Moreover, the AUG start codon is perfectly conserved except in one strain that has the near-cognate GUG instead, and the stop codon is conserved but with a one-codon extension in SARS-CoV-2 and RaTG13. There are also no in-frame stop codons or indels. Strikingly, this alternate-frame ORF has many synonymous substitutions that are non-synonymous in ORF3a, indicating that this new ORF may be the primary constraint acting in this region, over the corresponding segment of ORF3a. Lastly, 40 of the 41 codons are covered by synonymous constraint elements, and this constraint ends nearly perfectly at the boundaries of the overlapping ORF (Fig.  6). Together, these lines of evidence allow us to conclude that this overlapping ORF encodes a conserved, functional protein.
Two previous studies proposed that this new ORF may be protein-coding on the basis of increased synonymous constraint on the overlapping region of ORF3a, initially across 6 very closely-related strains (Cagliani et al. 2020), and subsequently across a broad set of Sarbecovirus strains (Firth 2020), naming it ORF3h (for "Hypothetical") and ORF3a*, respectively. It was predicted to contain a transmembrane domain suggestive of a viroporin (Cagliani et al. 2020), and to be translated from the ORF3a subgenomic RNA via leaky scanning (Firth 2020). However, increased synonymous constraint does not uniquely argue for proteincoding constraint, and could stem from other types of overlapping functional elements. A third study used ribosome footprints to argue this alternate reading frame of ORF3a is translated (Finkel et al. 2020), but a ribosome profiling signal can be due to incidental, non-functional translation; in fact, out of nine candidate novel protein-coding ORFs predicted by this study, eight lack any conservation. By contrast, the well-powered PhyloCSF evidence presented here shows that this ORF has a conserved protein-coding function specifically selected for its amino-acid translation. Given the clear evidence for conserved protein-coding function across Sarbecovirus genomes, including SARS-CoV-2 and SARS-CoV-1, we propose a standard name for this ORF, namely ORF3c, as neither ORF3h, which indicates "hypothetical", nor ORF3a*, which is non-standard, seems appropriate for a deeply-conserved ORF with clear protein-coding function.
The candidate with the next best score (-2.74) is a 32-codon ORF (26183-26278) that overlaps the 3' end of ORF3a and the 5' end of E (Supplemental Fig. S6A). Two strains have a frame-shifting 1-base deletion within the ORF, and two others have premature stop codons. None of the substitutions are synonymous. There is high nucleotide-level constraint, but it continues on both sides of the ORF, suggesting it results from something other than translation of the ORF. Overall, this ORF does not show the evolutionary signature of a functional coding sequence. Next in the list is ORF9b which we have already discussed. Fourth is a 31-codon ORF (3207-3299) overlapping ORF1a, having PhyloCSF score -7.77 (Supplemental Fig. S6B). Most of the ORF consists of a 75-nt insertion that is only present in SARS-CoV-2, RaTG13, and CoVZC45, and the start and stop codons are missing in CoVZC45, so this is not a conserved coding sequence. Finally, the fifth-ranked candidate is ORF14, which we have already discussed.
The relatively high scores of ORFs 9b and 14 among these 67 hypothetical ORFs are in part an artifact of the low density of substitutions throughout N, which they both overlap. This low density, which is found even in the parts of N that are not in ORFs 9b or 14, decreases the statistical power available to PhyloCSF for distinguishing its coding and noncoding evolutionary models, which compresses the PhyloCSF score towards 0, resulting in a better rank among the negative scores. If we compensate for this by dividing by the maximumlikelihood branch length scale factor computed by PhyloCSF for its coding and non-coding models, ORFs 9b and 14, while still in the top half, move down to the 89th and 79th percentile among the 67 ORFs considered, whereas ORF3c remains the best scoring-candidate (Supplemental Fig. S7).
To search for additional novel protein-coding regions, we relaxed our criteria to include ORFs with at least 10 Gene content of SARS-CoV-2 p 13 codons, allow non-cognate start codons, and allow ORFs contained within another ORF in the same frame, but we found no additional convincing candidates for conserved protein-coding regions. Because it has been conjectured that translation might occur on the large number of negative-strand genomic and subgenomic RNAs that are intermediates in viral gene expression and replication in positive-strand RNA viruses (Dinan et al. 2020;DeRisi et al. 2019), we also scored ORFs on the negative strand, but again found no convincing candidates. Supplemental Table S4 contains the complete list of ORFs, with scores and other pertinent information. Figure 6. Novel ORF3c. Phylogenetic evidence for an unannotated protein-coding ORF near the 5' end of ORF3a. (A) UCSC Genome Browser image shows ORF3c overlapping ORF3a in a different reading frame. A pair of synonymous constraint elements closely match ORF3c as is expected for a dual coding region. The PhyloCSF signal in the reading frame of ORF3c (frame 2), which is negative for most of ORF3a, is close to 0 within ORF3c, indicating some PhyloCSF signal despite the downward pressure on the PhyloCSF score from constraint in the reading frame of ORF3a. ORF3c might be translated from the ORF3a subgenomic RNA via leaky scanning. (B and C) CodAlignView color-coding of the alignment of the region from 3 nt 5' of the start codon of ORF3a until the stop codon of ORF3c in the reading frame of ORF3c (B) and of ORF3a (C) show that a substantial fraction of substitutions are synonymous (light green) and conservative amino acid changes (dark green) in each frame, as expected in coding regions. The start codon of ORF3c is conserved except in one strain that has the near cognate GUG, and its stop codon is conserved except for a one-codon insertion in two strains, with no intermediate stop codons in any strain.   SARS-CoV-1 3b, 8a, and 8b are not conserved coding genes We then turned our attention to the three annotated ORFs in SARS-CoV-1 that do not have orthologous ORFs in SARS-CoV-2, namely ORFs 3b, 8a, and 8b. We found that ORF3b, which overlaps 3a, shows poor PhyloCSF protein-coding constraint (score per codon -13.2), and contains numerous stop codons in other strains including SARS-CoV-2, indicating it doesn't have a conserved protein-coding function (Supplemental Fig. S8). Finally ORFs 8a and 8b, two fragments of ORF8 that were separated into distinct ORFs during the 2003 SARS outbreak by a 29-nt deletion (Supplemental Fig. S9), do not exist as ORFs elsewhere in the Sabercovirus phylogeny, indicating they do not give rise to conserved proteins, and that their previouslyreported effect on viral replication (Muth et al. 2018) is likely due to ORF8 loss-of-function rather than 8a/8b gain-of-function.

Sarbecovirus conservation informs analysis of SARS-CoV-2 variants
Finally, we investigated how conservation within the Sarbecovirus clade can help inform our understanding of variation between different isolates of SARS-CoV-2. Since the outbreak of the COVID-19 pandemic, over 1800 single-nucleotide variants (SNVs) have been identified in SARS-CoV-2 isolates. We would expect variants in amino acids or nucleotides that have been highly conserved in the larger clade to be more likely to have a phenotypic effect, so we classified SNVs into five categories according to whether they were intergenic, missense (amino acid changing) in conserved amino acid positions, missense in non-conserved amino acid positions, synonymous in synonymously-constrained codons, or synonymous in synonymously-unconstrained codons (Supplemental Table S3). We defined "conserved" amino acids to be those for which there were no amino acid-changing substitutions in the Sarbecovirus alignment of that codon. We defined codons to be synonymously constrained if they have a low synonymous substitution rate.
To determine if conservation within the Sarbecovirus clade correlates with purifying selection within the SARS-CoV-2 population, we examined the densities of SNVs in conserved and non-conserved positions (Fig. 7).
We first calculated the fraction of amino acid positions that were conserved by our definition in each of the mature proteins and hypothetical ORFs (Fig. 7A). We observe that more than 83% of amino acids are perfectly conserved in nsp5 (3CL-PRO), nsp7, nsp8, nsp9, nsp10, nsp12 (Pol), nsp13 (Hel), and nsp14 (ExoN), whereas a much lower fraction of amino acids are conserved in nsp1, nsp2, and nsp3. Amino acid conservation in S is high 3' of, and low 5' of, the cleavage site. Amino acid conservation is lower in the unnamed ORFs than the named ones, particularly ORFs 6 and 8. Note that our definition of amino acid conservation does not depend on the phylogenetic tree, so these results are robust even if the tree varies along the genome due to recombination events.
We conclude that conservation in the Sarbecovirus clade at both the amino acid and nucleotide level is associated with purifying selection on SNVs in the SARS-CoV-2 population.
Since SNVs are most likely to have a phenotypic effect if they change a conserved amino acid, we searched for clusters of such SNVs. We found that the region of the nucleocapsid protein encoded by genomic locations 28826 through 28885 is significantly enriched for missense SNVs among its 14 conserved amino acids (14 SNVs, p < 0.012 after conservative multiple-hypothesis correction), suggesting that the region has been under positive selection or relaxed purifying selection in SARS-CoV-2 (Fig. 7D, Supplemental Fig. S10). There are no other such clusters in the genome that are significantly denser than would be expected by chance. Nor are there any regions that are significantly depleted for missense SNVs in conserved amino acids, which would have indicated regions in which constraint in the Sarbecovirus clade has continued particularly strongly in the p 16 SARS-CoV-2 population; the most depleted regions are 7400-7840 in nsp3 with no missense SNVs among 103 conserved amino acids and 24437-24748 in S2 with no missense SNVs among 99 conserved amino acids (p = 0.072 and p = 0.093, respectively, without any correction for multiple region lengths searched) (Supplemental Fig. S11).
To aid researchers in using our classification of variants, we have created a track hub for the UCSC Genome Browser with each SNV color-coded according to our five categories. The details page for each SNV includes a link to view the alignment of a neighborhood of the SNV using CodAlignView. The track hub also includes tracks showing which codons are conserved at the amino acid and synonymous levels to aid other researchers in classifying SNVs as they are discovered (Fig. 7D).
As examples, we analyzed two sets of variants that have been proposed as possibly affecting the viral phenotype. First, we investigated Sarbecovirus conservation of 14 amino acids in the spike protein in which mutations appear to be accumulating in the SARS-CoV-2 population , namely D614G, L5F, L8V/W, H49Y, Y145H, Q239K, V367F, G476S, V483A, V615I/F, A831V, D839Y/N/E, S943P, P1263L. These are included in the "KorberMutation" column of Supplemental Table S3, with hyperlinks to view the alignment near each of these mutations. Of particular interest is D614G, which has risen in frequency in multiple geographic locations, suggesting that it increases transmissibility. This radical amino-acid change is near the middle of a string of 11 amino acids that are perfectly conserved among our Sarbecovirus genomes (Fig. 7E), implying that it would have been deleterious in most of the Sarbecovirus clade; since, to the contrary, it appears to be increasing in the human population, this suggests that it is an adaptation to the human host. Likewise, two others, V615I/F and P1263L are mutations of perfectly conserved amino acids, while A831V is in a highly-conserved region of the protein and its amino acid is conserved in all but the two most distantly-related strains. In contrast, L5F, L8V/W, H49Y, Y145H, Q239K, G476S, and V483A are in amino acids that are not conserved and are in poorly-conserved regions of the protein, so they are less likely to be required for a conserved function. The remaining three are in moderately-conserved contexts with ambiguous interpretation.
Second, we looked at variants from 11 isolates (referred to as ZJU-1 through ZJU-11) that were functionally characterized and found to have different temporal patterns of viral load in-vitro (Yao et al. 2020). Among the 25 loci where at least one of these isolates differs from the reference genome, T27772A is a nonsense mutation that disrupts ORF7b but is present in 7% of the viral RNA in ZJU-11, suggesting that this ORF is not essential for replication. We classified the other 24 according to the evolutionary evidence and found that five are likely to be highly disruptive, another five are somewhat disruptive, four are missense mutations in residues that have been evolutionarily permissive of amino acid changes, and the remaining nine are synonymous in non-synonymously-constrained contexts (Supplemental Table S3). One of the somewhat disruptive mutations is a synonymous change in a 41-codon SCE at the C-terminus of the spike protein, and the other disruptive mutations are missense. Interestingly, one of the highly disruptive mutations, G23607A, is a radical R->Q amino acid change in the polybasic cleavage site of S, which is only present in SARS-CoV-2 (Andersen et al. 2020); it is present in all viral RNA in ZJU-1, whose viral load was near the mean, suggesting that this residue might have little effect on the ability of the virus to gain access to and replicate in cells. The two outliers with unusually high viral load after 24 hours, ZJU-10 and ZJU-11, each have exactly one mutation that we classified as highly disruptive, namely C16114T in ZJU-10 and a trimer substitution TTG->CGA at 27775-27777 in ZJU-11, suggesting that these are the mutations most likely to be responsible for the higher viral load.   Lower synonymous SNV density in synonymously constrainted codons D614G: a radical amino acid change amid 11 perfectly conserved amino acids but not all, the vast majority of amino acids are perfectly conserved; that the second part of the S protein is much more conserved than the first part; and that the named proteins are better conserved than the unnamed and hypothetical proteins. (B) Density of amino acidchanging single nucleotide variants (SNVs) among conserved (dark red) amino acid positions is significantly lower than in nonconserved (light red) positions. Both densities are higher near the 3' end of the genome, indicating higher mutation rate or less purifying selection even among amino acids that are perfectly conserved in Sarbecovirus. (C) Density of synonymous SNVs in synonymously-constrained codons (dark green) is significantly lower than among synonymously-unconstrained codons (light green). These results show that conservation in the Sarbecovirus clade at both the amino acid level and nucleotide level is associated with purifying selection on SNVs in the SARS-CoV-2 population. (D) Region in N enriched for missense SNVs in conserved amino acids. UCSC Genome Browser image of our SARS-CoV-2 conservation track hub, which provides information helpful in determining which SNVs are most likely to have a phenotypic effect. SNV conservation track indicates whether SNV is missense in a conserved amino acid (bright red), missense in a non-conserved amino acid (light red), synonymous in a synonymously-constrained codon (bright green), synonymous in a synonymously-unconstrained codon (light green), or noncoding (black, not shown). Other tracks indicate all constrained amino acids and synonymously-constrained codons. This 20 amino acid region of the nucleocapsid protein is significantly enriched for missense SNVs in amino acids that are perfectly conserved among our 44 Sarbecovirus genomes, suggesting positive selection or relaxed purifying selection in SARS-CoV-2. (E) Example of a mutation in a deeply conserved segment of the spike protein.
Sarbecovirus alignment near an A to G nucleotide substitution at genomic location 23403 that gives rise to the amino acid change D614G in the spike protein, which has risen in frequency in multiple geographic locations, suggesting that it increases transmissibility. This mutation is near the middle of a string of 11 amino acids that are perfectly conserved among our Sarbecovirus genomes (all substitutions are light green, designating synonymous substitutions), which suggests that the mutation could be an adaptation to the human host.

Discussion
We used comparative genomic methods to determine which of the unnamed ORFs in SARS-CoV-2 and SARS-CoV-1 show the evolutionary signature of conserved, functional, protein-coding regions. We found that SARS-CoV-2 ORFs 3a, 6, 7a, 7b, and 8 have this signature, whereas ORFs 10, and 14 do not, and 9b is ambiguous. We also independently rediscovered a recently proposed novel dual coding region within ORF3a, ORF3c, using different methods, and provide strong evolutionary evidence for its coding potential. In SARS-CoV-1, ORFs 3a, 6, 7a, and 7b have this evolutionary signature, but 3b does not and 9b is again ambiguous, while 8a and 8b are too recent to determine their functional status from evolutionary signatures. We have also classified single nucleotide variants according to their evolutionary constraint, and used this approach to help interpret variants from two studies. These techniques should be applicable to other sets of variants as researchers try to untangle the connection between viral genotype and disease phenotype. Correct proteincoding annotations are essential not only for understanding viral biology, but also for predicting the phenotypic effect of variants, because determining how each variant affects protein sequence is the first step in any such analysis. As an example of the importance of correct annotations, we note that seven variants within ORF3c (T25473C, T25476C, G25494T, G25500A, G25500T, C25539T, C25572T) were classified by nextstrain as synonymous based on their predicted effect on ORF3a, but in fact cause amino acid changes in the ORF3c protein.
Our comparative genomics methods complement experimental approaches by providing a more comprehensive view of conserved function, with the caveat that in some cases, the evolutionarily-conserved function selected over the vast majority of the evolutionary interval studied may have recently changed, and thus evolutionary history may not reflect present state, which is better captured by experimental methods. However, experimental approaches only detect what is present under the specific conditions tested, whereas the comparative genomics approach used here can distinguish functional vs. non-functional regions based on their characteristic patterns of change, or evolutionary signatures, reflecting mutational perturbation experiments over millions of generations that survey conditions experienced by the virus in all hosts throughout the evolutionary history spanned by the genomes compared.
The stark differences between nucleotide-level (phyloP/phastCons) and protein-level (PhyloCSF) constraint in ORF8 and ORF10 highlight the importance of protein-coding evolutionary signatures vs. nucleotide-level constraint. While phyloP and phastConst rely on the number of substitutions, PhyloCSF instead relies on the type of substitutions, distinguishing those typical of coding vs. non-coding regions, regardless of the total number of substitutions.
Our analyses used a single genome-wide phylogenetic tree, but it is known that there is substantial recombination in Sarbecoviruses, leading to different evolutionary histories for different genomic segments, and segment boundaries have been identified within the S gene and the polyprotein (Wu et al. 2020;Sun et al. 2020;Andersen et al. 2020). PhyloCSF is relatively insensitive to the tree, and in fact an earlier version, CSF, did not make use of the tree. On the other hand, FRESCo relies more heavily on the tree, but it normalizes scores within a gene to the gene-wide average, which limits the effect of an incorrect tree provided that all of the gene has the same evolutionary history. We are not aware of any known recombination points within N or ORF3a, so our conclusions about overlapping reading frames in those genes are unlikely to be affected by this concern.
We identified a 20-amino acid region in the nucleocapsid protein that is significantly enriched for amino acidchanging variants in amino acids that have been conserved throughout the Sarbecovirus clade. Investigation of the effects of these variants on protein structure could yield insights into human adaptation.
Further experimental work will be needed to determine the functions of the unnamed genes and the effects of SARS-CoV-2 variants, which might lead to the identification of weaknesses of the virus. We hope that our conclusions and that the resources we have provided will help guide experimenters to the most fruitful investigations.

Genomes and Alignments
Genome sequences were obtained from https://www.ncbi.nlm.nih.gov/. The genomes and NCBI annotations for SARS-CoV-2 and SARS-CoV-1 were obtained from the records for accessions NC_045512.2 and NC_004718.3, respectively. The UniProt annotations for SARS-CoV-2 were obtained from the UCSC Genome Browser (Haeussler et al. 2019).
The 44 Sarbecovirus genomes used in this study were selected starting from all betacoronavirus and unclassified coronavirus full genomes listed on ncbi via searches https://www.ncbi.nlm.nih.gov/nuccore/?term=txid694002[Organism:exp] and the same with txid1986197 and txid2664420 on 5-Mar-2020, excluding any that differed from NC_045512.2 in more than 10,000 positions in a pairwise alignment computed using NW-align (Lab 2-Apr-2012), that cutoff being chosen so as to distinguish Sarbecovirus genomes among those that were classified, and removing near duplicates, including all SARS-CoV-1 and SARS-CoV-2 genomes other than the reference. Coronavirus genomes in the left half of Fig. 2 were those listed by https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=11118 on 11-Feb-2020.
The genomes were aligned using clustalo (Sievers and Higgins 2018) with the default parameters. The Phylogenetic tree was calculated using RAxML (Stamatakis 2014) using the GTRCATX model.

PhyloCSF, FRESCo, and other conservation metrics
PhyloCSF (Lin et al. 2011a) was run using the 29mammals empirical codon matrices but with the Sarbecovirus tree substituted for the mammals tree. Input alignments were extracted from the whole-genome alignment and columns containing a gap in the reference sequence were removed. Browser tracks were created as described previously (Mudge et al. 2019). Scores listed in Supplemental Table S2 were calculated on the local alignment for each ORF or mature protein, excluding the final stop codon, using the default PhyloCSF parameters, including --strategy=mle, plus --debug in order to get the maximum-likelihood branch length scale factors for the coding and non-coding models. The mean and standard deviation of PhyloCSF-per-codon scores of protein-coding ORFs were calculated using the scores of the NCBI ORFs, excluding ORF1a because it is redundant with ORF1ab and excluding ORF10 because we had already determined it is not protein-coding; the mean and standard deviation for non-coding ORFs were calculated from those in Supplementary Table 4 in the initial subset, excluding ORFs 3h, 9b, and 14, since those were the ones under investigation.protein-coding ORFs were the NCBI ORFs excluding ORF FRESCo (Sealfon et al. 2015) was run on 9-codon windows in each of the NCBI annotated ORFs. Alignments were extracted for the ORF excluding the final stop codon, and gaps in the reference sequence were removed.
SCEs were found by taking all windows having synonymous rate less than 1 and nominal p-value<10 -5 , and combining overlapping or adjacent windows. For the variant analysis, FRESCo was also run on 1-codon windows using codon alignments as described previously (Sealfon et al. 2015).
Substitutions per site and per neutral site for each annotated ORF and mature protein were calculated by extracting the alignment column for each site or, respectively, 4-fold degenerate site, from the whole-genome alignment and determining the parsimonious number of substitutions using the whole-genome phylogenetic tree. For columns in which some genomes did not have an aligned nucleotide, the number of substitutions was scaled up by the branch length of the entire tree divided by the branch length of the tree of genomes having an aligned nucleotide in that column PhastCons and phyloP tracks shown in Fig. 2 are the Comparative Genomics tracks from the UCSC Genome Browser, which were constructed from a multiz (Blanchette et al. 2004) alignment of the list of 44 Sarbecovirus genomes that we supplied to UCSC.

Analysis of Single Nucleotide Variants
Single nucleotide variants were downloaded from the "Nextstrain Vars" track in the UCSC Table Browser on 2020-04-18 at 11:46 AM EDT. We defined an amino acid to be "conserved" if there were no amino acidchanging substitutions in the Sarbecovirus alignment of its codon. We defined codons to be "synonymously constrained" if the p-value for the synonymous rate at that codon calculated by FRESCo using 1-codon windows was less than 0.034, which corresponds to a false discovery rate of 0.125.
To find regions that were significantly enriched for missense SNVs in conserved amino acids, we first defined a null model as follows. For each mature protein, we counted the number of missense SNVs and the number of conserved amino acids and randomly assigned each SNV to a conserved amino acid in the same mature protein, allowing multiplicity. For any positive integer n, we found the largest number of SNVs that had been assigned to any set of n consecutive conserved amino acids within the same mature protein across the whole genome. Doing this 100,000 times gave us a distribution of the number of missense SNVs in the most enriched set of n consecutive conserved amino acids in the genome. Comparing the number of actual missense SNVs in any particular set of n consecutive conserved amino acids to this distribution gave us a nominal p-value for that n. We applied this procedure for each n from 1 to 100 and multiplied the resulting p-values by a Bonferroni correction of 100 to calculate a corrected p-value for a particular region to be significantly enriched. We note that these 100 hypotheses are correlated because enriched regions of different lengths can overlap, so a Bonferroni correction is overly conservative and our reported p-value of 0.012 understates the level of statistical significance. To find significantly depleted regions we applied a similar procedure with every n from 1 to 1000, but did not find any depleted regions with nominal p-value less than 0.05 even without any multiple hypothesis correction.

Supplemental Materials:
• Supplemental Figures S1-S11 • Supplemental Table S1. Tab-separated table with  These consisted of all SARS-CoV-2 ORFs at least 10 codons long, on either strand, beginning with AUG or a near-cognate codon, that do not overlap an NCBI-annotated gene in the same reading frame or the antisense frame (the frame on the opposite strand that shares the 3rd codon position; antisense regions gets artifactually high PhyloCSF scores). Our initial subset consisted of those on the '+' strand, with a canonical (AUG) start codon, at least 25 codons long, that are maximal (i.e, not contained in a longer AUGinitiated ORF in the same frame). ORFs in our initial subset are listed first, in order of decreasing PhyloCSF score per codon, followed by all other ORFs, also in order of decreasing PhyloCSF score per codon. Spreadsheet fields include general information about the ORF, links to view the alignment in CodAlignView with 10 codons on each side for context, links to view the region in the UCSC Genome browser, PhyloCSF score per codon, branch length of strains present in the local alignment as a fraction of total branch length (RelBL), PhyloCSF's branch length scale factors for its coding and noncoding models (RhoC and RhoN), adjusted score consisting of PhyloCSF score per codon divided by the average of RhoC and RhoN, relative branch length of strains conserving the start codon/ stop codon/reading frame, GC content, fraction of the ORF that overlaps Synonymous Constraint Elements, and whether the ORF was reported as translated in the Finkel et al. ribosome profiling experiments. • Pdf files containing the alignment of each UniProt-annotated SARS-CoV-2 ORF and mature protein with 5 codons of the neighbor on each side, color-coded by CodAlignView for protein-coding evolutionary features. • Whole-genome alignment of 44-Sarbecovirus genomes in Fasta format.
• Whole-genome phylogenetic tree in Newick format. • Nextstrain_ncov_global_metadata.tsv: List of authors who contributed genomes to GISAID that were used by nextstrain and UCSC to produce the list of SNVs.

Data Access
The PhyloCSF tracks and FRESCo synonymous constraint elements are available for the SARS-CoV-2/wuhCor1 assembly in the UCSC Genome Browser (Haeussler et al. 2019) using the "PhyloCSF" and "Synonymous Constraint" public track hubs. The alignments and phylogenetic tree are included as supplemental materials. The alignments may be viewed, color coded to indicate protein-coding signatures, using CodAlignView (https://data.broadinstitute.org/compbio1/cav.php) with alignment set wuhCor1_c and chromosome name NC_045512v2.
SARS-CoV-2 single nucleotide variants, color coded by whether they are non-coding, synonymous, or amino acid-changing, and whether they are in conserved codons, as well as indications of which codons are conserved at the amino acid or synonymous level, may be viewed in the UCSC Genome Browser using the track hub at https://data.broadinstitute.org/compbio1/SARS-CoV-2conservation/trackHub/hub.txt. The details page for each SNV includes information about Sarbecovirus conservation and a link to view the alignment of a neighborhood of the SNV in CodAlignView. It is our intention to update this track hub as the list of variants in the UCSC Table Browser is updated.
In this resource, we have augmented variant data made available by UCSC with our own annotations. UCSC data came from nextstrain.org (Hadfield et al. 2018), which was derived from genome sequences deposited in GISAID (Elbe and Buckland-Merrett 2017). Right of use and publication of the underlying sequences is entirely controlled by the authors of the original resource and the contributors of individual sequences, who are acknowledged in the nextstrain metadata file included with supplemental materials. Our analysis provides an additional layer of annotation on their work rather than replicating or replacing it.
Original data usage policy as provided by UCSC: p 22 The data presented here is intended to rapidly disseminate analysis of important pathogens. Unpublished data is included with permission of the data generators, and does not impact their right to publish. Please contact the respective authors (available via the Nextstrain metadata.tsv file) if you intend to carry out further research using their data. Derived data, such as phylogenies, can be downloaded from nextstrain.org (see Yao H-P, Lu X, Chen Q, Xu K, Chen Y, Cheng L, Liu F, Wu Z, Wu H, Jin C, et al. 2020. Patient-Derived Mutations Impact Pathogenicity of SARS-CoV-2. https://papers.ssrn.com/abstract=3578153 (Accessed May 9, 2020).

Supplemental Figures
Supplemental Figure S1. Alignment of nsp11 and frameshift site. Alignment of nonstructural protein nsp11 and the subsequent 5 Slippery Site codons in 44 Sarbecoviruses (top) and 52 coronaviridae (bottom). Sarbecoviruses included in the coronaviridae alignment are indicated by green dots. The slippery site of the programmed frameshift (red rectangle) is perfectly conserved in all genomes. The polymerase, Pol, shares the 5' nine codons of nsp11 but then continues 3' of the slippery site in a different reading frame. The four codons 3' of the slippery site are perfectly conserved in Sarbecoviruses, which is consistent with a dual coding region. However, the stop codon of the un-framshifted polyprotein, pp1a, which marks the 3' end of nsp11, is poorly conserved in coronaviridae (cyan, magenta, and yellow stop codons). This, and the fact that nsap11 is only 13 codons long, suggest that it does not produce a functional protein.
0.0 0.5 1.0 1.5 2.0 2.5 Figure S2. ORF8 Phylogeny. (A) Phylogenetic tree computed from the alignment of ORF8, which is quite different from the genome-wide tree in Fig. 2, indicating that there has been a recombination event near this region. (B) Alignment of ORF8 with strains ordered according to the ORF8 tree shows a large number of substitutions, demonstrating that the apparently high evolutionary rate in ORF 8 is not an artifact of recombination.  KF367457_Bat_SARS_like_CoV_WIV1 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACT TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CTT GAA GAC CCA TGT CCA ATA CAT TAC  KY417147_Bat_SARS_like_CoV_Rs4237 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACT TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CTT GAA GAC CCA TGT CCA ATA CAT TAC  KY417151_Bat_SARS_like_CoV_Rs7327 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACA TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA TCA TTC CAA CTT GAA GAC CCA TGT CCA ATA CAT TAC  MK211377_CoV_BtRs_BetaCoV_YN2018C ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACA TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CTT GAA GAC CCA TGT CCA ATA CAT TAC  KT444582_SARS_like_CoV_WIV16 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACA TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CCT GAA GAC CCA TGT CCA ATA CAT TAT  KY417146_Bat_SARS_like_CoV_Rs4231 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACA TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CCT GAA GAC CCA TGT CCA ATA CAT TAT  KY417152_Bat_SARS_like_CoV_Rs9401 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACA TCA GTG TAT TGC GTG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CCT GAA GAC CCA TGT CCA ATA CAT TAT  KY417148_Bat_SARS_like_CoV_Rs4247 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACA TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CCT GAA GAC CCA TGT CCA ATA CAT TAT  KY417143_Bat_SARS_like_CoV_Rs4081 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACA TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CTT GAA GAC CCA TGT CCA ATA CAT TAT  KY417149_Bat_SARS_like_CoV_Rs4255 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACA TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CTT GAA GAC CCA TGT CCA ATA CAT TAT  KY417142_Bat_SARS_like_CoV_As6526 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACA TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CTT GAA GAC CCA TGT CCA ATA CAT TAT  MK211378_CoV_BtRs_BetaCoV_YN2018D ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACA TCA GTG TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CCT GAA GAC CCA TGT CCA ATA CAT TAC  FJ588686_Bat_SARS_CoV_Rs672_2006 ATG AAA CTT CTC ATT GTT TTA GGA CTC TTA ACA TCA GTA TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAC CAA CCA TTC CAA CTT GAA GAT CCA TGT CCA ATA CAT TAC  MK211375_CoV_BtRs_BetaCoV_YN2018A ATG AAA CTT TTC GTT GTT TTA GGA CTC TTA ACA TCA GTA TAT TGC ATG CAC AAA GAA TGC AGT-ATA CAA GAA TGT TGT GAA AAT CAA CCA TTC CAA CTT GAA GAC CCA TGT CCA ATA CAT TAC  MK211374_CoV_BtRl_BetaCoV_SC2018 ATG AAA CTT CTC ATT GTT TTT GGA CTC TTG ACA TCA GTA TAT TGC ATG CAT AAA GAA TGC AGT-ATA CAA GAA TGT TGT  Nearly all footprints within ORF10 are in either the predicted uORF overlapping the ORF10 start codon, or the predicted downstream ORF beginning at an interior AUG, which would create a peptide of only 18 amino acids (only 5 amino acids in all but the four closest strains, since the others have an early stop codon), and the density of footprints in the unique portion of ORF10 (dashed black rectangle) appears to be no greater than the density beyond its stop codon (dashed red rectangle), suggesting they are not due to translation of ORF10. (B) Alignment of ORF10 in SARS-CoV-2, three bat viruses, and two pangolin viruses, used by Cagliani et al. to infer that the ORF is protein-coding and under positive selection due to a high dN/dS ratio (Cagliani et al. 2020). However, the alignment includes only nine substitutions, one of them synonymous. In the hypothetical translation of a non-coding region evolving neutrally, we would expect between two and three of nine substitutions to be synonymous, and the depletion to only one is not statistically significant (p > 0.18 even without the necessary multiple hypothesis correction). Furthermore, one sequence was excluded due to a 1-base frameshifting deletion (orange and grey), which, if it is not a sequencing error, would be strong evidence against conserved proteincoding function.  ,550 29,560 29,570 29,580 29,590 29,600 29,610 29,620 29,630 29,640 29,650 29,660 29,670 29,680 29,690 N NC_045512_SARS_CoV_2_Wuhan_Hu_1 ATG CTG CAA TCG TGC TAC AAC TTC CTC AAG GAA CAA CAT TGC CAA AAG GCT TCT ACG CAG AAG GGA GCA GAG GCG GCA GTC AAG CCT CTT CTC GTT CCT CAT CAC GTA GTC GCA ACA GTT CAA GAA ATT CAA CTC CAG GCA GCA GTA GGG GAA CTT CTC CTG CTA GAA TGG CTG GCA ATG GCG GTG ATG CTG CTC TTG CTT TGC TGC TGC TTG ACA GAT TGA  MN996532_Bat_CoV_RaTG13 ATG CTG CAA TCG TGC TAC AAC TTC CTC AAG GAA CAA CAT TGC CAA AAG GCT TCT ACG CAG AAG GGA GCA GAG GTG GCA GTC AAG CTT CTT CTC GCT CTT CAT CAC GTA GTC GCA ACA GTT CAA GAA ACT CAA CTC CAG GCA GCA GTA GGG GAA CTT CCC CTG CTA GGA TGG CTG GCA ATG GCA GTG ATG CTG CTC TTG CTT TGC TGC TGC TTG ACA GAT TGA  MG772933_Bat_SARS_like_CoV_bat_SL_CoVZC45 ATG CTG CAA TCG TGC TAC AAC TTC CTC AAG GAA CAA CAT TGC CAA AAG GCT TCT ACG CAG AAG GGA GCA GAG GCG GCA GTC AAG CTT CTT CAC GCT CCT CAT CAC GTA GTC GCA ACA GTT CAA GAA ACT CAA CTC CAG GCA GCA GTA GGG GAA CTT CTC CTG CTA GAA TGG CTG GCA ATG GCG GTG ACA CTG CTC TTG CTT TGC TGC TGC TAG ATA GGT TGA  MG772934_Bat_SARS_like_CoV_bat_SL_CoVZXC21 ATG CTG CAA TCG TGC TAC AAC TTC CTC AAG GAA CAA CAT TGC CAA AAG GCT TCT ACG CAG AAG GGA GCA GAG GCG GCA GTC AAG CTT CTT CAC GCT CCT CAT CAC GTA GTC GCA ACA GTT CAA GAA ACT CAA CTC CAG GCA GCA GTA GGG GAA CTT CTC CTG CTA GAA TGG CTG GCA ATG GCG GTG ACA CTG CTC TTG CTT TGC TGC TGC TAG ATA GGT TGA  NC_004718_SARS_CoV ATG CTG CCA CCG TGC TAC AAC TTC CTC AAG GAA CAA CAT TGC CAA AAG GCT TCT ACG CAG AGG GAA GCA GAG GCG GCA GTC AAG CCT CTT CTC GCT CCT CAT CAC GTA GTC GCG GTA ATT CAA GAA ATT CAA CTC CTG GCA GCA GTA GGG GAA ATT CTC CTG CTC GAA TGG CTA GCG GAG GTG GTG AAA CTG CCC TCG CGC TAT TGC TGC TAG ACA GAT TGA  KT444582_SARS_like_CoV_WIV16 ATG CTG CCA CCG TGC TAC AAC TTC CTC AAG GAA CAA CAT TGC CAA AAG GCT TCT ACG CAG AGG GGA GCA GAG GCG GCA GTC AAG CCT CTT CTC GCT CTT CGT CAC GTA GTC GCG GTA ATT CAA GAA ATT CAA CTC CTG GCA GCA GTA GGG GAA ATT CTC CTG CTC GAA TGG CTA GCG GAG GTG GTG AAA CTG CCC TCG CGC TAT TGC TGC TAG ACA GAT TGA  KY417146_Bat_SARS_like_CoV_Rs4231 ATG CTG CCA CCG TGC TAC AAC TTC   , respectively, are optimal for ribosomal start codon recognition. The start codon of N (purple box) is 10 nt 5' of start codon of 9b, with less-optimal bases A and T in positions -3 and +4 of this start codon (orange boxes), suggesting that ORF9b could be translated from the same subgenomic RNA as N by leaky scanning. Also shown is the same region in the frame of the overlapping nucleocapsid protein (bottom), in which most substitutions are synonymous (light green). Evolutionary evidence is ambiguous regarding whether 9b is coding.  A CTA AAA TGT CTG ATA ATG GAC CCC A---AA ATC AGC GAA ATG CAC CCC GCA TTA CGT TTG GTG GAC CCT CAG ATT CAA CTG GCA GTA ACC AGA ATG GAG AAC GCA GTG GGG CGC GAT CAA AAC AAC GTC GGC CCC AAG GTT TAC CCA ATA ATA CTG CGT CTT GGT TCA CCG CTC TCA CTC AAC ATG GCA AGG AAG ACC TTA AAT TCC CTC GAG GAC AAG GCG TTC CAA TTA ACA CCA ATA GCA GTC CAG ATG ACC AAA TTG GCT ACT ACC GAA GAG CTA CCA GAC GAA TTC GTG GTG GTG ACG GTA AAA TGA  MN996532_Bat_CoV_RaTG13 A CTA AAA TGT CTG   9b start codon N start codon

ORF9b
Same region in main reading frame  Figure S7. Adjusted scores of all ORFs. PhyloCSF score per codon divided by the maximum-likelihood branch length scale factor computed by PhyloCSF for its coding and non-coding models, for all the ORFs in Fig. 1C, namely all annotated and hypothetical AUG-initiated ORFs on the positive strand at least 25 codons long that do not overlap a longer ORF in the same frame. Dividing by this scale factor adjusts for the fact that in regions with a low frequency of substitutions, such as throughout ORFs N and 10, PhyloCSF has less statistical power to distinguish its coding and noncoding evolutionary models, which compresses the PhyloCSF score towards 0, resulting in a better rank among the negative scores. The lower scores of ORFs 10, 14, and 9b with this adjustment show that their relatively high negative scores in Fig. 1C are at least in part an artifact of the low frequency of substitutions in these genomic regions.  Figure S11. SNV-depleted regions. UCSC Genome Browser images of regions in nsp3 (A) and S2 (B). Most of the amino acids in these regions are conserved (red rectangles in Conserved AAs track), but the only missense SNVs in these regions (light red rectangles in SNVs track) are in non-conserved amino acids (missense SNVS in conserved amino acids would be bright red if present). The lack of missense SNVs in such a large set of conserved amino acid residues could indicate that constraint in the Sarbecovirus clade has continued particularly strongly in the SARS-CoV-2 population. However, although these are the most depleted regions in the genome for missense SNVs in conserved amino acid residues, neither depletion is statistically significant (p = 0.072 and p = 0.093, respectively, without any correction for multiple region lengths searched).