MicroRNAs: Principles of Target Recognition and Developmental Roles by Vikram Agarwal B.S. Biology (2009) University of Texas at Austin SUBMITTED TO THE COMPUTATIONAL AND SYSTEMS BIOLOGY GRADUATE PROGRAM IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY AT THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2015 © 2015 Vikram Agarwal. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Signature of author…………………………………………………………………............. Vikram Agarwal Computational and Systems Biology Program August 28, 2015 Certified by……………………………………………………………………………….... David P. Bartel Professor of Biology Thesis Supervisor Accepted by………………………………………………………………………………... Christopher Burge Professor of Biology and Biological Engineering Director, Computational and Systems Biology Graduate Program 1 2 MicroRNAs: Principles of Target Recognition and Developmental Roles by Vikram Agarwal Submitted to the Computational and Systems Biology Program on August 28, 2015, In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Abstract MicroRNAs (miRNAs) are ~21–24 nt non-coding RNAs that mediate the degradation and translational repression of target mRNAs. The genomes of vertebrate organisms encode hundreds of miRNAs, each of which may regulate hundreds of mRNA targets. Thus, miRNAs are crucial post-transcriptional regulators engaged in vast regulatory networks. To date, the characteristics of these networks remain mysterious due to the difficulty of identifying miRNA targets through either experimental or computational means. To understand the physiological roles of miRNAs in animal species, it is of fundamental importance to elucidate the structure of the targeting networks in which they participate. The recognition of a miRNA target is guided largely by perfect Watson-Crick base pairing interactions between nucleotides 2–7 from the 5′ end of the miRNA (i.e., the “seed” region) and complementary motifs embedded in the 3′ UTRs of the target mRNAs. The prevalence of these motifs throughout the transcriptome poses a challenge to our understanding of how specificity emerges: since the presence of a motif is not sufficient to mediate target repression, what contextual features discriminate effective target sites from ineffective ones? Further complicating this is the proposition that “non- canonical” sites lacking perfect seed pairing might mediate repression, which would expand the potential number of functional target sites by orders of magnitude. In the second chapter of this work, we define the features that predict effective miRNA target sites, incorporating their relative influence into a quantitative model which can out- perform existing computational models and experimental approaches in target identification. Though the molecular roles of miRNAs in gene regulation have long been appreciated, the functions of most miRNAs in living organisms has remained elusive. In the third chapter of this work, we discuss the consequences of genetic ablation of miR-196, a deeply conserved miRNA that is predicted to simultaneously repress many HOX genes, in the mouse. We propose a role for miR-196 in the spatial patterning of the vertebrate axial skeleton. Isolating the cell populations that express the miRNA during early mammalian development, we attempt to characterize the direct in vivo targets of miR-196 and dissect the molecular underpinnings of the phenotypes observed. Thesis Advisor: David P. Bartel Title: Professor of Biology 3 4 Acknowledgments I am indebted to my professor, David Bartel, for being an outstanding mentor and role model during the course of my graduate work. His level of scientific rigor, enduring patience, attention to detail, and ready willingness to offer his help and extensive feedback, has made a lasting impression on me and will undoubtedly influence my style of scientific inquiry throughout the course of my life. I thank my thesis committee members, Phil Sharp and Chris Burge, for providing me extensive feedback on my work throughout the years, and for their helpful advice on career opportunities. I also thank Gary Ruvkun for serving as my outside committee member. There are also many professors at MIT who taught their courses with great passion, and their inspiring methods of teaching have greatly impacted my interests in biology and computer science. I am grateful the graduate students and postdocs who mentored me throughout these years. Robin Friedman in particular was instrumental in patiently explaining the statistical methods in phylogenomics that he developed. I have also had countless discussions with David Garcia, Jin-Wu Nam, Alex Subtelny, Igor Ulitsky, Olivia Rissland, and Junjie Guo that have broadened the scope of my thinking and heavily impacted the work presented in this thesis. I thank my scientific collaborators over the years, particularly Rémy Denzler and Markus Stoffel, with whom I had the opportunity to explore interesting questions concerning physiology. Rémy has also been a great friend and I have been lucky to have great fun in our travels together. My work with Eddy McGlinn reignited my interests in exploring developmental questions, and I thank her for giving me the opportunity to work with her group and for helping me understand the biology and improve my communication of the work in presentations. The Bartel lab has been an incredible place to work and I couldn’t have asked for a more welcoming home. Beyond colleagues, the people in the lab have been close friends and I appreciate every member of the past and present. Inside the lab, they’ve made it a great environment to discuss ideas openly together, and outside the lab, they’ve made Cambridge and Boston a great town to explore together. I thank everyone in the Computational and Systems Biology (CSB) class of 2009 (Chris, Anna, Adrian, Zi, and Xuebing) for their continued friendship, as well as friends in the Microbiology (Mark, Nicole, and Chris) and Biology (Josh and Brian) programs for making my experiences in Cambridge tremendously enjoyable and memorable. I also thank Bonnielee Whang and Jacquie Carota for their support of the CSB program. Lastly, I thank my family for supporting me throughout my life and inspiring an interest in exploring scientific questions early on. My brother has been greatly influential in my work and it has been a pleasure learning from his life experiences, many of which I’ve paralleled in mine. 5 6 Table of Contents Abstract ........................................................................................................................................... 3 Acknowledgements ......................................................................................................................... 5 Chapter 1. Introduction ................................................................................................................. 9 The many layers of gene regulation ............................................................................................ 9 MicroRNAs: Discovery and biological roles ................................................................................. 11 Biogenesis of microRNAs and mechanisms of targeting ............................................................... 15 Computational approaches to microRNA target prediction ........................................................... 17 Experimental approaches to microRNA target identification ........................................................ 23 References ............................................................................................................................... 28 Chapter 2. Predicting effective microRNA target sites in mammalian mRNAs ..................... 37 Abstract .................................................................................................................................... 38 Introduction ............................................................................................................................. 38 Results ..................................................................................................................................... 43 Inefficacy of recently reported non-canonical binding sites.................................................. 43 Confirmation that miRNAs bind to non-canonical sites despite their inefficacy ................... 48 Improving dataset quality for model development ............................................................... 52 Selecting features and building a regression model for target prediction .............................. 55 Improvement over previous methods ................................................................................... 59 Similar response of targets predicted from the model and the most informative CLIP experiments .................................................................................................................. 63 The TargetScan database (v7.0) .......................................................................................... 66 Discussion ................................................................................................................................ 69 Materials and Methods ............................................................................................................ 78 Microarray, RNA-seq, and RPF dataset processing ............................................................. 78 Crosslinking and other interactome datasets ........................................................................ 80 Motif discovery for non-canonical binding sites .................................................................. 82 Microarray dataset normalization ........................................................................................ 83 RNA structure prediction .................................................................................................... 84 Calculation of PCT scores .................................................................................................... 85 Selection of mRNAs for regression modeling ...................................................................... 86 Scaling the scores of each feature ........................................................................................ 87 Stepwise regression and multiple linear regression models .................................................. 88 Collection and processing of previous predictions ............................................................... 89 3′-UTR profiles for TargetScan7 predictions ....................................................................... 90 MicroRNA sets for TargetScan7 ......................................................................................... 92 TargetScan7 predictions ...................................................................................................... 93 Acknowledgements ................................................................................................................. 96 References ............................................................................................................................... 97 Figures and figure legends ................................................................................................ 105 Tables ............................................................................................................................... 140 Chapter 3. Independent regulation of vertebral number and vertebral identity by microRNA-196 paralogs ........................................................................................................... 143 Abstract ................................................................................................................................. 144 Introduction .......................................................................................................................... 144 Results .................................................................................................................................. 148 7 Differential transcription of miR-196a1 and miR-196a2 in the developing embryo ............ 148 Genetic deletion of miR-196 leads to altered vertebral identity ............................................ 149 Genetic deletion of miR-196 leads to an increase in vertebral number ................................. 151 Transcriptome alterations are detected following allelic removal of miR-196 activity ........ 152 Hox cluster expression dynamics are altered in miR-196 mutant embryos .......................... 153 Identification of additional direct targets of miR-196 ..........................................................155 miR-196 activity is required for signaling pathways associated with axis elongation, segmentation and the trunk-to-tail transition ........................................................................ 156 miR-196 has the potential to modulate Wnt signaling by multiple mechanisms .................. 158 Discussion ............................................................................................................................. 160 miR-196 activity is essential for vertebral identity ............................................................... 160 miR-196 activity constrains total vertebral number .............................................................. 162 Materials and Methods ......................................................................................................... 165 miR-196a1GFP and miR-196a2GFP knock-in construction ..................................................... 165 miR-196a1–/– and miR-196a2–/– and miR-196b–/– generation ................................................. 165 Mouse skeletal preparation and analysis ............................................................................... 166 In situ hybridization .............................................................................................................. 166 FACS sorting and RNA-seq sample preparation .................................................................. 166 RNA-seq and category enrichment analysis ......................................................................... 166 miRNA target analysis .......................................................................................................... 167 Permutation test for significance testing ............................................................................... 168 In vitro luciferase assay ........................................................................................................ 168 Chick electroporation and in vivo BatLuc reporter analysis ................................................. 168 Acknowledgements ............................................................................................................... 169 References ............................................................................................................................. 171 Figures and figure legends ............................................................................................... 177 Tables .............................................................................................................................. 197 Chapter 4. Future Directions ................................................................................................... 199 Quantitative models of miRNA targeting in Drosophila ..................................................... 199 Conservation of miRNA targeting networks among bilaterians ......................................... 201 References ............................................................................................................................ 203 Appendix 1. Global analysis of the effect of different cellular contexts on microRNA targeting ..................................................................................................................................... 205 Appendix 2. Assessing the ceRNA hypothesis with quantitative measurements of miRNA and target abundance ............................................................................................................... 219 Appendix 3. Expanded identification and characterization of mammalian circular RNAs .......................................................................................................................................... 231 Curriculum Vitae ......................................................................................................................... 247 8 Chapter 1. Introduction The many layers of gene regulation It is a remarkable experience to marvel at the diversity of forms among the organisms inhabiting our planet. Plants and animals exhibit a wide range of shapes, sizes, and behaviors; they have adapted to most habitats, conquering the seas, lands, and skies. It is likely that the morphological diversity that is observed throughout life is largely a result of two evolutionary processes: the birth of genes and acquisition of novel gene function (Kaessmann, 2010; Tautz and Domazet-Loso, 2011; Carvunis et al., 2012) as well as gene regulatory innovation (Wray, 2007; Carroll, 2008). While gene innovation may have played a greater role early in evolutionary time (i.e., between 3–3.5 billion years ago) (David and Alm, 2011), organismal complexity in higher eukaryotes may have instead arisen from the sophisticated regulation of gene expression (Levine and Tjian, 2003). The central dogma of molecular biology details the predominant mode of information flow in cells: genes are encoded in DNA, transcribed into messenger RNAs (mRNAs), and these mRNAs are translated into proteins (Crick, 1970). A large body of evidence suggests that every step of this process appears to be intricately regulated, and the cell has exploited a variety of modes of regulation to exponentiate the range of cellular behaviors possible with a limited set of protein-coding genes. A paradigm in molecular biology has become that the genome does not just passively encode genes, but rather that it carries a set of instructions to coordinate the expression of those genes in time and space (Jacob and Monod, 1961). With stunning foresight, Jacob and Monod postulated that proteins may recognize cis-regulatory DNA or RNA sequences and thereby modulate the expression or translation of an mRNA (1961). Subsequent work has 9 reinforced this model of transcriptional control by unraveling the genome-wide architecture of protein binding events to cis-regulatory DNA elements (Ren et al., 2000; Harbison et al., 2004). Similarly, it has been demonstrated that cis-regulatory sequences within mRNA can orchestrate mRNA splicing, export from the nucleus to the cytoplasm, localization, translation rate, and degradation rate (Glisovic et al., 2008). Global measurements of transcription rate, mRNA degradation rate, translation rate, and protein degradation rate among mRNAs confirms that each process is amenable to regulation. The variability in the distributions of these rates cannot be accounted for as a trivial result of measurement error, and in many scenarios the precise molecular mechanisms explaining a proportion of the variability are known. Recent studies have attempted to dissect the relative contributions of each form of regulation in explaining steady state protein abundance. While initial estimates arrived at a conclusion that variability in translational regulation was the predominant force determining protein levels (Schwanhausser et al., 2011), revised estimates propose a predominant role of transcriptional regulation, with about 73% contribution, relative to an 11% contribution of mRNA decay, 8% contribution of translation rate, and 8% contribution of protein decay (Li et al., 2014). However, these estimates ignore the fact that throughout development, protein abundances are not at steady state, but rather change dynamically with time in response to environmental and cellular signals. So far, it appears that changes in mRNA levels (i.e., a combination of mRNA synthesis and degradation rates) also explain ~90% of protein fold changes in a dynamic response to an environment cue, although protein translation and degradation rates together explain ~60% of absolute protein changes in this context (Jovanovic et al., 2015). Taken together, these studies 10 highlight the crucial importance of understanding each node of gene regulation in order to acquire a comprehensive portrait of the gene regulatory networks that govern cellular behavior and organismal development. MicroRNAs: Discovery and biological roles MicroRNAs (miRNAs) are ~21–24 nt non-coding RNAs involved in post-transcriptional gene regulation (Bartel, 2004). The first known miRNA was discovered as a non-protein- coding product of the gene lin-4, a regulator of developmental timing in C. elegans (Lee et al., 1993). Interestingly, it was found that the miRNA possesses a sequence with antisense complementarity to multiple stretches of nucleotides in the lin-14 3′ UTR (Lee et al., 1993; Wightman et al., 1993), a region known to govern the regulation of LIN-14 protein production and consequently the timing of larval development (Wightman et al., 1991). This observation strongly implied that miRNAs could associate directly to target mRNAs and repress the production of the encoded protein. It soon became clear that this phenomenon was not a peculiarity of the worm, but rather that there also exist other miRNAs such as let-7 (Reinhart et al., 2000) that are deeply conserved across animal life (Pasquinelli et al., 2000). Moreover, these two miRNA genes were not sporadic examples, but rather comprise an abundant and diverse class of small regulatory RNAs (Lagos-Quintana et al., 2001; Lau et al., 2001; Lee and Ambros, 2001), prevalent across multiple kingdoms of eukaryotic life, including both plants and animals (Reinhart et al., 2002; Lim et al., 2003). The miRNAs of these species repress a diverse suite of targets, although the mechanisms of targeting differ between plants and animals. While plants require 11 extensive complementary with most of the miRNA (Rhoades et al., 2002), pairing between a 7 nt region in the miRNA and a complementary motif in the target mRNA is necessary to mediate target repression in animals (Doench and Sharp, 2004) and sufficient to predict conserved mRNA targets above the noise of false-positive predictions (Brennecke et al., 2005; Krek et al., 2005; Lewis et al., 2005). Animal miRNAs can thereby be broadly classified into miRNA families depending on the identity of their 7 nt region sequence, as different miRNAs sharing this sequence tend to share a similar repertoire of targets (Lewis et al., 2005; Anderson et al., 2008). An interesting property that arose from the widespread characterization of miRNAs in many animal lineages was the appreciation that many of them have ancient origins and are deeply conserved across the animal phylogeny (Grimson et al., 2008; Wheeler et al., 2009). Animal miRNAs appear to have evolved concomitantly with the beginnings of multicellularity, as they cannot be detected in choanoflagellates, a single- celled organism considered to be an outgroup to the metazoans (Grimson et al., 2008). Following the classification of animal miRNAs into families, it was soon realized that many miRNA families have persisted for ~580–670 million years since the emergence of bilaterian and metazoan life (Figure 1). Even more miRNA families have emerged among the vertebrate and invertebrate clades more recently in evolutionary time (Figure 1), with thousands of additional species-specific miRNAs being continually annotated in databases (Griffiths-Jones et al., 2008; Kozomara and Griffiths-Jones, 2014). Given the preponderance of miRNAs among animal species and their deep conservation, it is only natural to wonder about their biological functions. The generation of miRNA knockouts in animals has provided a powerful framework by which to 12 evaluate the in vivo functions of individual miRNAs. While lin-4 and let-7 were discovered to play roles in cell-fate decisions during early C. elegans larval development (Lee et al., 1993; Wightman et al., 1993; Reinhart et al., 2000), little was known about the functions of other worm miRNAs. Through the genetic dissection of components governing the left-right asymmetry of chemoreception, the lsy-6 miRNA was discovered to repress cog-1, a transcription factor that mediates this cell-fate decision in the worm (Johnston and Hobert, 2003). Strikingly, a systematic knockout of ~90 additional worm miRNAs revealed that most miRNAs and their families are essential neither for development nor viability (Miska et al., 2007; Alvarez-Saavedra and Horvitz, 2010), making it challenging to address why they have been conserved. Potentially explaining this is the finding that many phenotypes can be observed when such knockouts are instead profiled in genetically sensitized backgrounds (Brenner et al., 2010). Figure 1. Deep conservation of miRNA families across animal life. Phylogeny of animal life, with single-celled choanoflagellates serving as an outgroup species. Each number on a node of the tree represents the number of shared miRNA families that likely existed in the common ancestor of all of the extant species branching from it. Numbers are derived from the latest annotation of conserved miRNA families released in targetscan.org. 13 Furthermore, a large-scale miRNA knockout study in the fly revealed that nearly 80% of miRNAs exhibit a phenotype, often related to survival and lifespan (Chen et al., 2014). Parallel work has revealed functions for miRNAs in vertebrate species. MicroRNAs are collectively crucial for early vertebrate development. Losing the ability to produce miRNAs in the mouse results in severe abnormalities during day 7.5 of embryonic development (E7.5), ultimately resulting in lethality (Bernstein et al., 2003). Similarly, in early zebrafish development, loss of miRNAs compromises brain morphogenesis (Giraldez et al., 2005), potentially due to the role of miR-430 in the timely clearance of maternally deposited mRNA, a process that is crucial during the maternal-to-zygotic transition (Giraldez et al., 2006). MicroRNAs also have diverse physiological roles in mammals, impacting limb and axial skeletal development [e.g., miR-196 (Hornstein et al., 2005; McGlinn et al., 2009)], muscle development and cardiac function [e.g., miR-1 (Zhao et al., 2007) and miR-208 (van Rooij et al., 2007)], immune system T cell and B cell development [e.g., miR-150 (Xiao et al., 2007) and miR-155 (Rodriguez et al., 2007; Thai et al., 2007)], immune system granulocytes differentiation [e.g., miR-223 (Johnnidis et al., 2008)], the control of the cell cycle and cancer [e.g., miR- 17~92 cluster (Ventura et al., 2008)], and skeletal system osteoclast growth [e.g., miR-34 (Krzeszinski et al., 2014)]. Mirroring the initial findings of the worm, a systematic knockout of ~50 conserved miRNAs in the mouse identified few miRNAs that impact viability (Park et al., 2012), potentially due to functional redundancy among different members of the same miRNA family. Given the deep conservation of miRNAs and the complex spatiotemporal expression dynamics that they exhibit during animal development, it is counterintuitive to 14 observe that the loss of many individual miRNAs generates only subtle phenotypes. Collectively, these findings have led many to suggest that most animal miRNAs may have evolved to tune the expression of their targets (Bartel and Chen, 2004), preferentially targeting lowly abundant mRNAs (Farh et al., 2005) to reduce their expression noise (i.e., in conjunction with increased transcription), thereby enhancing the precision of protein output during development (Schmiedel et al., 2015). Biogenesis of microRNAs and mechanisms of targeting Encoded from genes, animal miRNAs arise as the final consequence of a multistage biogenesis pathway (Figure 2). Like mRNAs, most miRNAs are transcribed by RNA Polymerase II, as constituents of precursors termed primary miRNAs (or “pri-miRNAs”) (Lee et al., 2004). A unique property of pri-miRNAs is that they encode a ~60-70 nt region that folds into a hairpin RNA secondary structure, which is found in either introns of protein-coding genes or the exons or introns of non-coding RNA transcripts. The hairpin is recognized by the RNase III enzyme Drosha (Lee et al., 2003), which along with the co-factor DGCR8 (Denli et al., 2004; Gregory et al., 2004), cleaves the pri- miRNA about 11 nt from the base of the hairpin, thus liberating a miRNA precursor (or “pre-miRNA”). While many hairpins exist in the genome, those recognized by Drosha tend to have distinguishing structural features (Lim et al., 2003), and frequently possess primary sequence motifs, such as a CNNC motif downstream of the basal stem, a UG at the base of the stem, and a UGUG motif in the apical loop (Auyeung et al., 2013). After the pre-miRNA is exported to the cytoplasm by Exportin-5 (Yi et al., 2003; Lund et al., 2004), the RNase III enzyme Dicer cleaves it to yield a double-stranded duplex (Grishok 15 et al., 2001; Hutvagner et al., 2001; Ketting et al., 2001; Knight and Bass, 2001). Finally, one strand of this duplex, termed the mature miRNA, is loaded into Argonaute based upon the thermodynamic asymmetry of the duplex (Khvorova et al., 2003; Schwarz et al., 2003). Argonaute (Ago) proteins are the effectors of miRNA-mediated repression and are central to the mechanism of target recognition (Grishok et al., 2001; Meister et al., 2004; Vaucheret et al., 2004). It is within the context of this ribonucleoprotein complex that Argonaute provides a molecular scaffold for the miRNA to nucleate pairing to a target RNA through its 5′ end (Schirle et al., 2014). It is common for plant miRNAs to have near-perfect complementarity to their targets, resulting in the Ago-mediated cleavage and ultimate degradation of the target (Rhoades et al., 2002). Although this Figure 2. Biogenesis of animal miRNAs and targeting mechanisms. Transcriptional activity within the genome gives rise to a hairpin-forming primary transcript. Drosha recognizes and cleaves this substrate, which is exported to the cytosol and cleaved by Dicer into an RNA duplex. One strand of this duplex is loaded into Argonaute, and upon recognition of a target mRNA through interactions in the miRNA seed region this mature miRNA is competent to modestly repress translation (grey lines) or destabilize the target mRNA (black lines). As an alternate mechanism, if the miRNA pairs very extensively to a target, it can mediate mRNA cleavage. 16 mechanism also exists in animals, in practice it is rare among most species, with HOXB8 being one of the few known cleavage targets of an animal miRNA (Yekta et al., 2004). Instead, the recognition of animal miRNA targets is thought to be guided predominately by perfect Watson-Crick base pairing interactions between nucleotides 2–7 from the 5′ end of the miRNA (i.e., the “seed” region) and complementary motifs embedded in the 3′ UTRs of the target mRNAs (Lewis et al., 2003; Doench and Sharp, 2004; Brennecke et al., 2005; Lewis et al., 2005; Lim et al., 2005; Bartel, 2009). Numerous studies have reported that functional regions that pair to the seed (i.e., seed matches) are enriched in the 3′ UTRs of transcripts relative to the 5′ UTRs and ORFs (Lewis et al., 2005; Grimson et al., 2007; Baek et al., 2008). This effect has been attributed to the fact that both 5′ UTR and ORF sites exist in the path of actively scanning and translating ribosomes, respectively (Grimson et al., 2007; Gu et al., 2009). Rather than directing cleavage, Ago binding to a seed match results in the recruitment of the CCR4-NOT deadenylase complex through an intermediate scaffold protein known as GW182 (Behm-Ansmant et al., 2006; Eulalio et al., 2008; Braun et al., 2011; Chekulaeva et al., 2011; Fabian et al., 2011). Although deadenylation leads to a brief period of translational repression (Bazzini et al., 2012; Eichhorn et al., 2014), the predominant effect of a miRNA is to orchestrate the degradation of a target mRNA (Baek et al., 2008; Guo et al., 2010; Eichhorn et al., 2014). Computational approaches to microRNA target prediction Many of the principles of miRNA target recognition were discovered through computational means, either through evolutionary analyses investigating the signals of 17 selection, or through analyses of miRNA perturbation datasets to uncover determinants of targeting. Although the high complementarity of plant miRNA targets made it straightforward to derive simple rules to predict such targets (Rhoades et al., 2002; Jones- Rhoades and Bartel, 2004; Allen et al., 2005), it was quickly realized that the prediction of animal miRNA targets was more challenging. An analysis of preferentially conserved miRNA-pairing motifs among three mammalian genomes revealed a signature of enriched pairing to the miRNA 5′ end relative to the sequences of shuffled miRNAs (Lewis et al., 2003). Soon thereafter, it was realized that animal miRNAs recognize several classes of target sites (also known as “miRNA recognition elements”) that typically range from 6–8 nt in length (Figure 3A). These are called “canonical site types” because they each maintain perfect Watson–Crick pairing to the seed region of the miRNA (Bartel, 2009). The five canonical site types, each having a signature of conservation among vertebrate genomes, are the 8mer site [match to miRNA positions 2– 8 with an A opposite position 1 (Lewis et al., 2005)], 7mer-m8 site [position 2–8 match (Lewis et al., 2003; Brennecke et al., 2005; Krek et al., 2005; Lewis et al., 2005)], 7mer- A1 site [position 2–7 match with an A opposite position 1 (Lewis et al., 2005)], 6mer site [position 2–7 match (Lewis et al., 2005)], and offset 6mer site [position 3–8 match (Friedman et al., 2009)]. It was discovered that the preference for the conservation of an adenosine opposite position 1 is independent of the miRNA nucleotide identity (Lewis et al., 2005). Collectively, these rules of pairing have been among the most sensitive signals in detecting animal miRNA targets, and many algorithms search for canonical sites in 3′ UTRs as an initial step towards the identification of miRNA targets (Lewis et al., 2003; Lewis et al., 2005; Gaidatzis et al., 2007; Grimson et al., 2007; Nielsen et al., 2007; 18 Wang and El Naqa, 2008; Garcia et al., 2011; Anders et al., 2012; Reczko et al., 2012). Despite extensive efforts, other site types have not been identified that exhibit a genome-wide signal for preferential conservation, including those possessing only a single mismatch or G:U wobble position to the seed region (Friedman et al., 2009). However, these findings do not preclude that possibility that such functional binding sites exist, or even that some are truly conserved. Indeed, there are a few confirmed instances in which effective sites have been observed to lack canonical seed pairing (i.e., called “non-canonical” sites, Figure 3B). For example, very extensive pairing to the 3′ region of the miRNA can compensate for a wobble or mismatch to one of the seed positions (Brennecke et al., 2005; Bartel, 2009), as exemplified by the two let-7 sites within the 3′ UTR of C. elegans lin-41 (Reinhart et al., 2000). These 3′-supplementary sites are exceedingly rare, with conserved miRNA families in mammals and nematodes each averaging <1 preferentially conserved 3′-supplementary site (Friedman et al., 2009; Jan et al., 2011). Other relatively rare, yet effective sites include centered sites, which have 11– Figure 3. Site types recognized by a miRNA. A) Five canonical site types, often located in 3′ UTRs, which can be recognized by miRNAs. The sites pair perfectly to the miRNA seed region through Watson-Crick base pairing interactions (vertical black lines), aside from an unpaired adenosine that is favored in 8mer or 7mer-A1 sites. B) Two non-canonical site types, characterized by a mismatch or bulge in the seed–target interface, which can be recognized by miRNAs. 19 12 contiguous Watson–Crick pairs to the center of the miRNA (Shin et al., 2010). Many computational techniques have attempted to identify additional non-canonical sites (Miranda et al., 2006; Kertesz et al., 2007; Griffiths-Jones et al., 2008; Betel et al., 2010; Liu et al., 2010; Sturm et al., 2010; Wen et al., 2011; Vejnar and Zdobnov, 2012; Marin et al., 2013; Bandyopadhyay et al., 2015; Gumienny and Zavolan, 2015), though the utility of these predictions remains unclear given that these sites show no evidence for preferential conservation. The length and information content of the motifs that miRNAs recognize influence the frequency of finding such motifs in genomic sequences. Because plant miRNAs require extensive complementarity to repress targets, they tend to have a small number of targets, which are often important developmental regulators such as transcription factors and hormone signaling proteins (Rhoades et al., 2002; Jones- Rhoades and Bartel, 2004; Allen et al., 2005). In contrast, the small size of animal miRNA target sites endows them with the property that they occur frequently in 3′ UTRs, which opens the possibility that the network of miRNA targets is much larger in animals. One method of assessing the scope of miRNA targeting in animals has been to quantify the signal for enrichment of predicted miRNA target sites relative to control k-mer sequences with the same length and similar nucleotide composition. These estimates have evolved with time depending upon: i) the availability of sequenced genomes for comparative analysis among species, ii) the quality of genome-wide multiple sequence alignments, and iii) the sophistication of evolutionary genomic techniques to detect signals for selection. The first attempts to estimate this number suggested that miRNAs conserved 20 among vertebrates target at least 400 mRNAs, or 1–2% of mRNAs (Lewis et al., 2003). As more mammalian genomes became available, this estimate expanded to 20–30% of mRNAs (Lewis et al., 2005; Xie et al., 2005). Finally, a method that accounted for the relatedness of species among a phylogeny and controlled for both dinucleotide and 3′ UTR conservation rates significantly expanded this estimate, implicating >60% of mRNAs as having undergone selective pressure to maintain pairing to miRNAs (Friedman et al., 2009). This finding illustrates the widespread connectivity of the miRNA targeting network (Figure 4), with >400 conserved targeting interactions on average per conserved miRNA family, and 4–5 conserved sites on average per mRNA (Friedman et al., 2009). In reality, the number of functional miRNA target sites is likely much higher as most sites are non-conserved yet can still function to reduce mRNA levels and protein output (Farh et al., 2005; Krutzfeldt et al., 2005; Lim et al., 2005; Grimson et al., 2007; Baek et al., 2008; Selbach et al., 2008). Despite the immensity of the miRNA–target regulatory network, the vast majority of target sites confer little to no repression (Figure 4). This implies that the mere presence of a target site is not always sufficient to mediate repression, and that other determinants Figure 4. Widespread connectivity of miRNAs in gene-regulatory networks. Graph of the proposed connectivity structure of a typical vertebrate-conserved miRNA in its network (above). The length of each edge emerging from the miRNA represents the amount of repression conferred upon that individual target. Most edges are short, indicating a large number of targets are weakly repressed. View from the perspective of a conserved mRNA (below). On average, each mRNA has 4–5 conserved sites in its 3′ UTR for different miRNAs, and more non-conserved sites (not shown). 21 can influence site efficacy. Over the years, computational work re-analyzing transcriptome-wide data in the context of a miRNA perturbation has revealed a number of such determinants. The earliest determinants were discovered as features that simply display a correlation to increased site efficacy, and could thereby be utilized to generate predictive models of target site efficacy. Factors that have somehow been shown to influence site efficacy include A/U composition in the site’s 3′ UTR (Robins and Press, 2005; Hausser et al., 2009), site conservation (Nielsen et al., 2007; Friedman et al., 2009), A/U composition in vicinity of the target site (Grimson et al., 2007; Nielsen et al., 2007), proximity of the site to the stop codon or poly(A) tail (Grimson et al., 2007), 3′ UTR length (Hausser et al., 2009), target sites in the ORF (Grimson et al., 2007; Reczko et al., 2012), RNA secondary structure in vicinity of the target site (Kertesz et al., 2007), thermodynamic stability of base-pairing (Garcia et al., 2011), and target site abundance in the transcriptome (Arvey et al., 2010; Garcia et al., 2011). The very best targets of a miRNA often have multiple 3′ UTR binding sites, as these sites typically behave either independently (Grimson et al., 2007; Nielsen et al., 2007) or cooperatively (Grimson et al., 2007) depending upon their distance from each other. In building quantitative models of miRNA target prediction, different groups have each evaluated only a subset of these features. Early work trained the parameters of a regression model on experimental data after using hand-selected features (Grimson et al., 2007; Kertesz et al., 2007; Nielsen et al., 2007; Garcia et al., 2011). In principle, a better approach would be to automate the selection of features using techniques from machine learning, which would avoid the potential pitfalls of having preconceptions of which features are useful. Many algorithms have attempted this (Wang and El Naqa, 2008; Betel 22 et al., 2010; Liu et al., 2010; Wen et al., 2011; Reczko et al., 2012; Vejnar and Zdobnov, 2012), but in practice their empirical performance remains unclear because there have been few comprehensive comparisons to evaluate their predictive accuracy. Furthermore, the quality of such a model depends heavily upon the nature of the training set used. The prediction of miRNA targets is crucial in assessing our understanding of the features influencing miRNA targeting, generating predictions of targets that the experimental community can prioritize as candidates of interest, and interpreting the functions of individual miRNAs in the context of the gene-regulatory networks to which they belong. Experimental approaches to microRNA target identification In the quest to characterize miRNA targets, experimentation has proven crucial for both assessing site efficacy and for directly probing miRNA–target interactions. The earliest experiments were low-throughput, validating the effects of single miRNA–target interactions using luciferase reporter assays (Doench and Sharp, 2004). In such an experiment, a miRNA was transfected into cultured cells and relative luciferase activity was measured for reporters fused to a 3′ UTR harboring a wild type or mutated target site for the miRNA. To parallelize such measurements and quantify the effects of a miRNA on endogenous genes, improved high-throughput methods were developed (Figure 5A), using microarrays to assess the effects of a transfected miRNA on the entire transcriptome (Lim et al., 2005). This approach, which obtains global mRNA fold change information in the context of a miRNA transfection, has provided a valuable resource for inquiry into the determinants that influence mRNA repression (Lim et al., 2005; Birmingham et al., 2006; Jackson et al., 2006a; Jackson et al., 2006b; Schwarz et al., 23 2006; Grimson et al., 2007; Linsley et al., 2007; Anderson et al., 2008). It has uncovered the key principle that the site type is the most important determinant in predicting the amount of repression an mRNA will experience (Grimson et al., 2007; Nielsen et al., 2007) (Figure 5B). It has also confirmed that miRNAs can regulate the expression levels of hundreds of mRNAs simultaneously (Lim et al., 2005), reinforcing the enormity of animal miRNA–target regulatory networks (Figure 4). The crosslinking and immunoprecipitation (CLIP) approach has emerged as powerful means of interrogating RNA–protein interactions (Ule et al., 2003). Such an approach depends upon the property that ultraviolet light can induce covalent crosslinks Figure 5. Measuring the effect of miRNAs on the transcriptome. A) Outline of a typical miRNA transfection experiment performed in mammalian cell culture, in which relative mRNA expression levels are measured using microarrays, and compared to each other in miRNA transfection (the experimental group) vs mock transfection (the control group) conditions. B) A plot of cumulative distributions of mRNA fold changes from the experiment devised in part (A) can be generated, comparing mRNAs lacking a canonical site in their 3′ UTR (black line) to those possessing a single instance of the indicated canonical site type in their 3′ UTR (colored lines). Each point of the plot represents the proportion of mRNAs with fold changes less than or equal to the corresponding fold change value on the x-axis. The distribution of fold changes is left-shifted in mRNAs possessing sites, indicating a global pattern of down-regulation of these mRNAs. Furthermore, the magnitude of this shift is indicative of relative strength of the site type. Figure 5B is reproduced from Friedman et al. (2009). 24 between amino acids and nucleic acids within short distances. Immunoprecipitation of the RNA-binding protein of interest following this crosslinking step can help isolate the fragments of RNA that it interacts with (Ule et al., 2003). CLIP combined with high- throughput sequencing [i.e. “HITS-CLIP” (Chi et al., 2009; Loeb et al., 2012)] and photoactivatable-ribonucleoside-enhanced variants of the technique [i.e. “PAR-CLIP” (Hafner et al., 2010; Lipchina et al., 2011)] have thus become important orthogonal approaches to identify regions of RNA bound by Argonaute in vivo. These approaches all observe significant enrichment for seed-matched sites that are cognate to highly abundant miRNAs in the vicinity of the crosslinks, validating their ability to detect authentic sites. However, they also suffer from the possibility of identifying many false positives, due in part to non-specificity of the IP (Friedersdorf and Keene, 2014), cross-linking bias (Lambert et al., 2014), and the difficulty of controlling for spurious background signals arising from highly abundant mRNAs (Jaskiewicz et al., 2012). The interpretation of CLIP datasets are further complicated by the fact that cells express a diversity of miRNAs, and information regarding which footprint corresponds to which miRNA has been lost. Therefore, a great deal of effort has been placed to infer the specific miRNA associated with each region sequenced (Chi et al., 2009; Hafner et al., 2010; Kishore et al., 2011; Jaskiewicz et al., 2012; Khorshid et al., 2013; Majoros et al., 2013). In attempts to circumvent these problems, other biochemical strategies have been devised. One such technique is called IMPACT-seq (identification of miRNA-responsive elements by pull- down and alignment of captive transcripts—sequencing), which sequences mRNA fragments that co-purify with a biotinylated miRNA without the need for crosslinking (Tan et al., 2014). Another is called CLASH (crosslinking, ligation, and sequencing of 25 hybrids), a high-throughput technique that generates miRNA–mRNA chimeras, which each identify a miRNA and the mRNA region that it binds (Helwak et al., 2013). A re- analysis found that many miRNA–mRNA chimeras also exist in Ago CLIP datasets, likely due to the activity of an endogenous RNA ligase (Grosswendt et al., 2014). Although chimeras unambiguously identify miRNA–mRNA interactions, they too have limitations in that: i) chimeras are rare and comprise only a small subpopulation of the sequencing data, and ii) certain types of interactions may be favored in the miRNA– mRNA ligation over others, giving a potentially biased representation of targeting interactions (Helwak et al., 2013; Grosswendt et al., 2014). Many of the RNA–protein interactions recovered from CLIP do not contain canonical sites that are cognate to any miRNA known to be expressed in the corresponding cell line (Hafner et al., 2010; Chi et al., 2012). This observation has led to the proposition that novel types of non-canonical sites might explain the detection of these “orphan clusters”. The first novel non-canonical site to be discovered as enriched in CLIP data was the “nucleation-bulge” site, which is characterized by a single-nucleotide bulge in the target site between nucleotides 6–7 of the miRNA (Chi et al., 2012). Another study identified ~30 non-canonical miR-155 sites—each with heterogeneous styles of pairing in the seed region—in wild-type but not miR-155 knockout T cells (Loeb et al., 2012). Data derived from CLASH further extended the types of non-canonical sites, implicating sites with stronger pairing in the center or 3′ end of a miRNA as governing binding (Helwak et al., 2013). Chimeras detected in CLIP, in contrast, have suggested that a variety of non-canonical sites exist in worms and mammals, although these sites tend to maintain pairing to the seed region of the miRNA (Grosswendt et al., 2014). 26 Finally, methods not relying on crosslinking have proposed that weak pairing in the 5′ or 3′ end of the miRNA is sufficient for binding (Tan et al., 2014). Because each study converges on different styles of non-canonical pairing, there does not appear to exist a unifying theme to explain the types of non-canonical sites that have been observed. The studies only agree that novel non-canonical sites that can mediate mRNA repression are more widespread than previously imagined. They all propose to expand the definition of functional target sites to incorporate non-canonical sites, a move poised to at least double the number of functional sites. Collectively, experiments focused on collecting transcriptome-wide data have provided foundational information for unraveling the structure of miRNA regulatory networks. 27 References Allen, E., Xie, Z., Gustafson, A.M., and Carrington, J.C. (2005). microRNA-directed phasing during trans-acting siRNA biogenesis in plants. Cell 121, 207-221. Alvarez-Saavedra, E., and Horvitz, H.R. (2010). Many Families of C. elegans MicroRNAs Are Not Essential for Development or Viability. Current Biology 20, 367-373. Anders, G., Mackowiak, S.D., Jens, M., Maaskola, J., Kuntzagk, A., Rajewsky, N., Landthaler, M., and Dieterich, C. (2012). doRiNA: a database of RNA interactions in post-transcriptional regulation. Nucleic Acids Res 40, D180-D186. Anderson, E.M., Birmingham, A., Baskerville, S., Reynolds, A., Maksimova, E., Leake, D., Fedorov, Y., Karpilow, J., and Khvorova, A. (2008). Experimental validation of the importance of seed complement frequency to siRNA specificity. RNA 14, 853-861. Arvey, A., Larsson, E., Sander, C., Leslie, C.S., and Marks, D.S. (2010). Target mRNA abundance dilutes microRNA and siRNA activity. Mol Syst Biol 6, 363. Auyeung, V.C., Ulitsky, I., McGeary, S.E., and Bartel, D.P. (2013). Beyond secondary structure: primary-sequence determinants license pri-miRNA hairpins for processing. Cell 152, 844-858. Baek, D., Villen, J., Shin, C., Camargo, F.D., Gygi, S.P., and Bartel, D.P. (2008). The impact of microRNAs on protein output. Nature 455, 64-71. Bandyopadhyay, S., Ghosh, D., Mitra, R., and Zhao, Z. (2015). MBSTAR: multiple instance learning for predicting specific functional binding sites in microRNA targets. Sci Rep 5, 8004. Bartel, D.P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116, 281-297. Bartel, D.P. (2009). MicroRNAs: target recognition and regulatory functions. Cell 136, 215-233. Bartel, D.P., and Chen, C.Z. (2004). Micromanagers of gene expression: the potentially widespread influence of metazoan microRNAs. Nature Reviews Genetics 5, 396- 400. Bazzini, A.A., Lee, M.T., and Giraldez, A.J. (2012). Ribosome Profiling Shows That miR-430 Reduces Translation Before Causing mRNA Decay in Zebrafish. Science 336, 233-237. Behm-Ansmant, I., Rehwinkel, J., Doerks, T., Stark, A., Bork, P., and Izaurralde, E. (2006). MRNA degradation by miRNAs and GW182 requires both CCR4 : NOT deadenylase and DCP1 : DCP2 decapping complexes. Genes & Development 20, 1885-1898. Bernstein, E., Kim, S.Y., Carmell, M.A., Murchison, E.P., Alcorn, H., Li, M.Z., Mills, A.A., Elledge, S.J., Anderson, K.V., and Hannon, G.J. (2003). Dicer is essential for mouse development. Nature Genetics 35, 215-217. Betel, D., Koppal, A., Agius, P., Sander, C., and Leslie, C. (2010). Comprehensive modeling of microRNA targets predicts functional non-conserved and non- canonical sites. Genome Biol 11, R90. Birmingham, A., Anderson, E.M., Reynolds, A., Ilsley-Tyree, D., Leake, D., Fedorov, Y., Baskerville, S., Maksimova, E., Robinson, K., Karpilow, J., et al. (2006). 3' UTR 28 seed matches, but not overall identity, are associated with RNAi off-targets. Nat Methods 3, 199-204. Braun, J.E., Huntzinger, E., Fauser, M., and Izaurralde, E. (2011). GW182 proteins directly recruit cytoplasmic deadenylase complexes to miRNA targets. Mol Cell 44, 120-133. Brennecke, J., Stark, A., Russell, R.B., and Cohen, S.M. (2005). Principles of microRNA-target recognition. PLoS Biol 3, e85. Brenner, J.L., Jasiewicz, K.L., Fahley, A.F., Kemp, B.J., and Abbott, A.L. (2010). Loss of individual microRNAs causes mutant phenotypes in sensitized genetic backgrounds in C. elegans. Curr Biol 20, 1321-1325. Carroll, S.B. (2008). Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution. Cell 134, 25-36. Carvunis, A.R., Rolland, T., Wapinski, I., Calderwood, M.A., Yildirim, M.A., Simonis, N., Charloteaux, B., Hidalgo, C.A., Barbette, J., Santhanam, B., et al. (2012). Proto-genes and de novo gene birth. Nature 487, 370-374. Chekulaeva, M., Mathys, H., Zipprich, J.T., Attig, J., Colic, M., Parker, R., and Filipowicz, W. (2011). miRNA repression involves GW182-mediated recruitment of CCR4-NOT through conserved W-containing motifs. Nat Struct Mol Biol 18, 1218-1226. Chen, Y.W., Song, S.L., Weng, R.F., Verma, P., Kugler, J.M., Buescher, M., Rouam, S., and Cohen, S.M. (2014). Systematic Study of Drosophila MicroRNA Functions Using a Collection of Targeted Knockout Mutations. Developmental Cell 31, 784- 800. Chi, S.W., Hannon, G.J., and Darnell, R.B. (2012). An alternative mode of microRNA target recognition. Nat Struct Mol Biol 19, 321-327. Chi, S.W., Zang, J.B., Mele, A., and Darnell, R.B. (2009). Argonaute HITS-CLIP decodes microRNA-mRNA interaction maps. Nature 460, 479-486. Crick, F. (1970). Central dogma of molecular biology. Nature 227, 561-563. David, L.A., and Alm, E.J. (2011). Rapid evolutionary innovation during an Archaean genetic expansion. Nature 469, 93-96. Denli, A.M., Tops, B.B.J., Plasterk, R.H.A., Ketting, R.F., and Hannon, G.J. (2004). Processing of primary microRNAs by the Microprocessor complex. Nature 432, 231-235. Doench, J.G., and Sharp, P.A. (2004). Specificity of microRNA target selection in translational repression. Genes Dev 18, 504-511. Eichhorn, S.W., Guo, H.L., McGeary, S.E., Rodriguez-Mias, R.A., Shin, C., Baek, D., Hsu, S.H., Ghoshal, K., Villen, J., and Bartel, D.P. (2014). mRNA Destabilization Is the Dominant Effect of Mammalian MicroRNAs by the Time Substantial Repression Ensues. Molecular Cell 56, 104-115. Eulalio, A., Huntzinger, E., and Izaurralde, E. (2008). GW182 interaction with Argonaute is essential for miRNA-mediated translational repression and mRNA decay. Nat Struct Mol Biol 15, 346-353. Fabian, M.R., Cieplak, M.K., Frank, F., Morita, M., Green, J., Srikumar, T., Nagar, B., Yamamoto, T., Raught, B., Duchaine, T.F., et al. (2011). miRNA-mediated deadenylation is orchestrated by GW182 through two conserved motifs that interact with CCR4-NOT. Nature Structural & Molecular Biology 18, 1211- 29 U1252. Farh, K.K., Grimson, A., Jan, C., Lewis, B.P., Johnston, W.K., Lim, L.P., Burge, C.B., and Bartel, D.P. (2005). The widespread impact of mammalian MicroRNAs on mRNA repression and evolution. Science 310, 1817-1821. Friedersdorf, M.B., and Keene, J.D. (2014). Advancing the functional utility of PAR- CLIP by quantifying background binding to mRNAs and lncRNAs. Genome Biol 15, R2. Friedman, R.C., Farh, K.K., Burge, C.B., and Bartel, D.P. (2009). Most mammalian mRNAs are conserved targets of microRNAs. Genome Research 19, 92-105. Gaidatzis, D., Nimwegen, E., Hausser, J., and Zavolan, M. (2007). Inference of miRNA targets using evolutionary conservation and pathway analysis. BMC Bioinformatics 8, 248. Garcia, D.M., Baek, D., Shin, C., Bell, G.W., Grimson, A., and Bartel, D.P. (2011). Weak seed-pairing stability and high target-site abundance decrease the proficiency of lsy-6 and other microRNAs. Nat Struct Mol Biol 18, 1139-1146. Giraldez, A.J., Cinalli, R.M., Glasner, M.E., Enright, A.J., Thomson, J.M., Baskerville, S., Hammond, S.M., Bartel, D.P., and Schier, A.F. (2005). MicroRNAs regulate brain morphogenesis in zebrafish. Science 308, 833-838. Giraldez, A.J., Mishima, Y., Rihel, J., Grocock, R.J., Van Dongen, S., Inoue, K., Enright, A.J., and Schier, A.F. (2006). Zebrafish MiR-430 promotes deadenylation and clearance of maternal mRNAs. Science 312, 75-79. Glisovic, T., Bachorik, J.L., Yong, J., and Dreyfuss, G. (2008). RNA-binding proteins and post-transcriptional gene regulation. FEBS Lett 582, 1977-1986. Gregory, R.I., Yan, K.P., Amuthan, G., Chendrimada, T., Doratotaj, B., Cooch, N., and Shiekhattar, R. (2004). The Microprocessor complex mediates the genesis of microRNAs. Nature 432, 235-240. Griffiths-Jones, S., Saini, H.K., van Dongen, S., and Enright, A.J. (2008). miRBase: tools for microRNA genomics. Nucleic Acids Res 36, D154-158. Grimson, A., Farh, K.K., Johnston, W.K., Garrett-Engele, P., Lim, L.P., and Bartel, D.P. (2007). MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Molecular Cell 27, 91-105. Grimson, A., Srivastava, M., Fahey, B., Woodcroft, B.J., Chiang, H.R., King, N., Degnan, B.M., Rokhsar, D.S., and Bartel, D.P. (2008). Early origins and evolution of microRNAs and Piwi-interacting RNAs in animals. Nature 455, 1193-1197. Grishok, A., Pasquinelli, A.E., Conte, D., Li, N., Parrish, S., Ha, I., Baillie, D.L., Fire, A., Ruvkun, G., and Mello, C.C. (2001). Genes and mechanisms related to RNA interference regulate expression of the small temporal RNAs that control C- elegans developmental timing. Cell 106, 23-34. Grosswendt, S., Filipchyk, A., Manzano, M., Klironomos, F., Schilling, M., Herzog, M., Gottwein, E., and Rajewsky, N. (2014). Unambiguous Identification of miRNA:Target Site Interactions by Different Types of Ligation Reactions. Molecular Cell. Gu, S., Jin, L., Zhang, F.J., Sarnow, P., and Kay, M.A. (2009). Biological basis for restriction of microRNA targets to the 3 ' untranslated region in mammalian mRNAs. Nat Struct Mol Biol 16, 144-150. 30 Gumienny, R., and Zavolan, M. (2015). Accurate transcriptome-wide prediction of microRNA targets and small interfering RNA off-targets with MIRZA-G. Nucleic Acids Res. Guo, H., Ingolia, N.T., Weissman, J.S., and Bartel, D.P. (2010). Mammalian microRNAs predominantly act to decrease target mRNA levels. Nature 466, 835-840. Hafner, M., Landthaler, M., Burger, L., Khorshid, M., Hausser, J., Berninger, P., Rothballer, A., Ascano, M., Jungkamp, A.C., Munschauer, M., et al. (2010). Transcriptome-wide Identification of RNA-Binding Protein and MicroRNA Target Sites by PAR-CLIP. Cell 141, 129-141. Harbison, C.T., Gordon, D.B., Lee, T.I., Rinaldi, N.J., Macisaac, K.D., Danford, T.W., Hannett, N.M., Tagne, J.B., Reynolds, D.B., Yoo, J., et al. (2004). Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99-104. Hausser, J., Landthaler, M., Jaskiewicz, L., Gaidatzis, D., and Zavolan, M. (2009). Relative contribution of sequence and structure features to the mRNA binding of Argonaute/EIF2C-miRNA complexes and the degradation of miRNA targets. Genome Research 19, 2009-2020. Helwak, A., Kudla, G., Dudnakova, T., and Tollervey, D. (2013). Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding. Cell 153, 654-665. Hornstein, E., Mansfield, J.H., Yekta, S., Hu, J.K.H., Harfe, B.D., McManus, M.T., Baskerville, S., Bartel, D.P., and Tabin, C.J. (2005). The microRNA miR-196 acts upstream of Hoxb8 and Shh in limb development. Nature 438, 671-674. Hutvagner, G., McLachlan, J., Pasquinelli, A.E., Balint, E., Tuschl, T., and Zamore, P.D. (2001). A cellular function for the RNA-interference enzyme Dicer in the maturation of the let-7 small temporal RNA. Science 293, 834-838. Jackson, A.L., Burchard, J., Leake, D., Reynolds, A., Schelter, J., Guo, J., Johnson, J.M., Lim, L., Karpilow, J., Nichols, K., et al. (2006a). Position-specific chemical modification of siRNAs reduces "off-target'' transcript silencing. RNA 12, 1197- 1205. Jackson, A.L., Burchard, J., Schelter, J., Chau, B.N., Cleary, M., Lim, L., and Linsley, P.S. (2006b). Widespread siRNA "off-target" transcript silencing mediated by seed region sequence complementarity. RNA 12, 1179-1187. Jacob, F., and Monod, J. (1961). Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol 3, 318-356. Jan, C.H., Friedman, R.C., Ruby, J.G., and Bartel, D.P. (2011). Formation, regulation and evolution of Caenorhabditis elegans 3'UTRs. Nature. Jaskiewicz, L., Bilen, B., Hausser, J., and Zavolan, M. (2012). Argonaute CLIP--a method to identify in vivo targets of miRNAs. Methods 58, 106-112. Johnnidis, J.B., Harris, M.H., Wheeler, R.T., Stehling-Sun, S., Lam, M.H., Kirak, O., Brummelkamp, T.R., Fleming, M.D., and Camargo, F.D. (2008). Regulation of progenitor cell proliferation and granulocyte function by microRNA-223. Nature 451, 1125-1129. Johnston, R.J., and Hobert, O. (2003). A microRNA controlling left/right neuronal asymmetry in Caenorhabditis elegans. Nature 426, 845-849. Jones-Rhoades, M.W., and Bartel, D.P. (2004). Computational identification of plant MicroRNAs and their targets, including a stress-induced miRNA. Molecular Cell 31 14, 787-799. Jovanovic, M., Rooney, M.S., Mertins, P., Przybylski, D., Chevrier, N., Satija, R., Rodriguez, E.H., Fields, A.P., Schwartz, S., Raychowdhury, R., et al. (2015). Immunogenetics. Dynamic profiling of the protein life cycle in response to pathogens. Science 347, 1259038. Kaessmann, H. (2010). Origins, evolution, and phenotypic impact of new genes. Genome Res 20, 1313-1326. Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U., and Segal, E. (2007). The role of site accessibility in microRNA target recognition. Nat Genet 39, 1278-1284. Ketting, R.F., Fischer, S.E.J., Bernstein, E., Sijen, T., Hannon, G.J., and Plasterk, R.H.A. (2001). Dicer functions in RNA interference and in synthesis of small RNA involved in developmental timing in C-elegans. Genes & Development 15, 2654- 2659. Khorshid, M., Hausser, J., Zavolan, M., and van Nimwegen, E. (2013). A biophysical miRNA-mRNA interaction model infers canonical and noncanonical targets. Nat Methods 10, 253-255. Khvorova, A., Reynolds, A., and Jayasena, S.D. (2003). Functional siRNAs and miRNAs exhibit strand bias. Cell 115, 505-505. Kishore, S., Jaskiewicz, L., Burger, L., Hausser, J., Khorshid, M., and Zavolan, M. (2011). A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins. Nat Methods 8, 559-564. Knight, S.W., and Bass, B.L. (2001). A role for the RNase III enzyme DCR-1 in RNA interference and germ line development in Caenorhabditis elegans. Science 293, 2269-2271. Kozomara, A., and Griffiths-Jones, S. (2014). miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Research 42, D68-D73. Krek, A., Grun, D., Poy, M.N., Wolf, R., Rosenberg, L., Epstein, E.J., MacMenamin, P., da Piedade, I., Gunsalus, K.C., Stoffel, M., et al. (2005). Combinatorial microRNA target predictions. Nat Genet 37, 495-500. Krutzfeldt, J., Rajewsky, N., Braich, R., Rajeev, K.G., Tuschl, T., Manoharan, M., and Stoffel, M. (2005). Silencing of microRNAs in vivo with 'antagomirs'. Nature 438, 685-689. Krzeszinski, J.Y., Wei, W., Huynh, H., Jin, Z., Wang, X., Chang, T.C., Xie, X.J., He, L., Mangala, L.S., Lopez-Berestein, G., et al. (2014). miR-34a blocks osteoporosis and bone metastasis by inhibiting osteoclastogenesis and Tgif2. Nature 512, 431- 435. Lagos-Quintana, M., Rauhut, R., Lendeckel, W., and Tuschl, T. (2001). Identification of novel genes coding for small expressed RNAs. Science 294, 853-858. Lambert, N., Robertson, A., Jangi, M., McGeary, S., Sharp, P.A., and Burge, C.B. (2014). RNA Bind-n-Seq: quantitative assessment of the sequence and structural binding specificity of RNA binding proteins. Mol Cell 54, 887-900. Lau, N.C., Lim, L.P., Weinstein, E.G., and Bartel, D.P. (2001). An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science 294, 858-862. Lee, R.C., and Ambros, V. (2001). An extensive class of small RNAs in Caenorhabditis elegans. Science 294, 862-864. 32 Lee, R.C., Feinbaum, R.L., and Ambros, V. (1993). The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 75, 843- 854. Lee, Y., Ahn, C., Han, J.J., Choi, H., Kim, J., Yim, J., Lee, J., Provost, P., Radmark, O., Kim, S., et al. (2003). The nuclear RNase III Drosha initiates microRNA processing. Nature 425, 415-419. Lee, Y., Kim, M., Han, J.J., Yeom, K.H., Lee, S., Baek, S.H., and Kim, V.N. (2004). MicroRNA genes are transcribed by RNA polymerase II. Embo Journal 23, 4051- 4060. Levine, M., and Tjian, R. (2003). Transcription regulation and animal diversity. Nature 424, 147-151. Lewis, B.P., Burge, C.B., and Bartel, D.P. (2005). Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120, 15-20. Lewis, B.P., Shih, I.H., Jones-Rhoades, M.W., Bartel, D.P., and Burge, C.B. (2003). Prediction of mammalian microRNA targets. Cell 115, 787-798. Li, J.J., Bickel, P.J., and Biggin, M.D. (2014). System wide analyses have underestimated protein abundances and the importance of transcription in mammals. Peerj 2. Lim, L.P., Glasner, M.E., Yekta, S., Burge, C.B., and Bartel, D.P. (2003). Vertebrate microRNA genes. Science 299, 1540. Lim, L.P., Lau, N.C., Garrett-Engele, P., Grimson, A., Schelter, J.M., Castle, J., Bartel, D.P., Linsley, P.S., and Johnson, J.M. (2005). Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature 433, 769-773. Linsley, P.S., Schelter, J., Burchard, J., Kibukawa, M., Martin, M.M., Bartz, S.R., Johnson, J.M., Cummins, J.M., Raymond, C.K., Dai, H., et al. (2007). Transcripts targeted by the microRNA-16 family cooperatively regulate cell cycle progression. Mol Cell Biol 27, 2240-2252. Lipchina, I., Elkabetz, Y., Hafner, M., Sheridan, R., Mihailovic, A., Tuschl, T., Sander, C., Studer, L., and Betel, D. (2011). Genome-wide identification of microRNA targets in human ES cells reveals a role for miR-302 in modulating BMP response. Genes & Development 25, 2173-2186. Liu, H., Yue, D., Chen, Y., Gao, S.J., and Huang, Y. (2010). Improving performance of mammalian microRNA target prediction. BMC Bioinformatics 11, 476. Loeb, G.B., Khan, A.A., Canner, D., Hiatt, J.B., Shendure, J., Darnell, R.B., Leslie, C.S., and Rudensky, A.Y. (2012). Transcriptome-wide miR-155 Binding Map Reveals Widespread Noncanonical MicroRNA Targeting. Molecular Cell 48, 760-770. Lund, E., Guttinger, S., Calado, A., Dahlberg, J.E., and Kutay, U. (2004). Nuclear export of microRNA precursors. Science 303, 95-98. Majoros, W.H., Lekprasert, P., Mukherjee, N., Skalsky, R.L., Corcoran, D.L., Cullen, B.R., and Ohler, U. (2013). MicroRNA target site identification by integrating sequence and binding information. Nat Methods 10, 630-633. Marin, R.M., Sulc, M., and Vanicek, J. (2013). Searching the coding region for microRNA targets. RNA 19, 467-474. McGlinn, E., Yekta, S., Mansfield, J.H., Soutschek, J., Bartel, D.P., and Tabin, C.J. (2009). In ovo application of antagomiRs indicates a role for miR-196 in 33 patterning the chick axial skeleton through Hox gene regulation. Proceedings of the National Academy of Sciences of the United States of America 106, 18610- 18615. Meister, G., Landthaler, M., Patkaniowska, A., Dorsett, Y., Teng, G., and Tuschl, T. (2004). Human Argonaute2 mediates RNA cleavage targeted by miRNAs and siRNAs. Mol Cell 15, 185-197. Miranda, K.C., Huynh, T., Tay, Y., Ang, Y.S., Tam, W.L., Thomson, A.M., Lim, B., and Rigoutsos, I. (2006). A pattern-based method for the identification of microRNA binding sites and their corresponding heteroduplexes. Cell 126, 1203-1217. Miska, E.A., Alvarez-Saavedra, E., Abbott, A.L., Lau, N.C., Hellman, A.B., McGonagle, S.M., Bartel, D.P., Ambros, V.R., and Horvitz, H.R. (2007). Most Caenorhabditis elegans microRNAs are individually not essential for development or viability. Plos Genetics 3, 2395-2403. Nielsen, C.B., Shomron, N., Sandberg, R., Hornstein, E., Kitzman, J., and Burge, C.B. (2007). Determinants of targeting by endogenous and exogenous microRNAs and siRNAs. RNA 13, 1894-1910. Park, C.Y., Jeker, L.T., Carver-Moore, K., Oh, A., Liu, H.J., Cameron, R., Richards, H., Li, Z.M., Adler, D., Yoshinaga, Y., et al. (2012). A Resource for the Conditional Ablation of microRNAs in the Mouse. Cell Reports 1, 385-391. Pasquinelli, A.E., Reinhart, B.J., Slack, F., Martindale, M.Q., Kuroda, M.I., Maller, B., Hayward, D.C., Ball, E.E., Degnan, B., Muller, P., et al. (2000). Conservation of the sequence and temporal expression of let-7 heterochronic regulatory RNA. Nature 408, 86-89. Reczko, M., Maragkakis, M., Alexiou, P., Grosse, I., and Hatzigeorgiou, A.G. (2012). Functional microRNA targets in protein coding sequences. Bioinformatics 28, 771-776. Reinhart, B.J., Slack, F.J., Basson, M., Pasquinelli, A.E., Bettinger, J.C., Rougvie, A.E., Horvitz, H.R., and Ruvkun, G. (2000). The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature 403, 901-906. Reinhart, B.J., Weinstein, E.G., Rhoades, M.W., Bartel, B., and Bartel, D.P. (2002). MicroRNAs in plants. Genes Dev 16, 1616-1626. Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., et al. (2000). Genome-wide location and function of DNA binding proteins. Science 290, 2306-2309. Rhoades, M.W., Reinhart, B.J., Lim, L.P., Burge, C.B., Bartel, B., and Bartel, D.P. (2002). Prediction of plant microRNA targets. Cell 110, 513-520. Robins, H., and Press, W.H. (2005). Human microRNAs target a functionally distinct population of genes with AT-rich 3' UTRs. Proc Natl Acad Sci USA 102, 15557- 15562. Rodriguez, A., Vigorito, E., Clare, S., Warren, M.V., Couttet, P., Soond, D.R., van Dongen, S., Grocock, R.J., Das, P.P., Miska, E.A., et al. (2007). Requirement of bic/microRNA-155 for normal immune function. Science 316, 608-611. Schirle, N.T., Sheu-Gruttadauria, J., and MacRae, I.J. (2014). Structural basis for microRNA targeting. Science 346, 608-613. Schmiedel, J.M., Klemm, S.L., Zheng, Y., Sahay, A., Bluthgen, N., Marks, D.S., and van Oudenaarden, A. (2015). Gene expression. MicroRNA control of protein 34 expression noise. Science 348, 128-132. Schwanhausser, B., Busse, D., Li, N., Dittmar, G., Schuchhardt, J., Wolf, J., Chen, W., and Selbach, M. (2011). Global quantification of mammalian gene expression control. Nature 473, 337-342. Schwarz, D.S., Ding, H.L., Kennington, L., Moore, J.T., Schelter, J., Burchard, J., Linsley, P.S., Aronin, N., Xu, Z.S., and Zamore, P.D. (2006). Designing siRNA that distinguish between genes that differ by a single nucleotide. PLoS Genetics 2, 1307-1318. Schwarz, D.S., Hutvagner, G., Du, T., Xu, Z.S., Aronin, N., and Zamore, P.D. (2003). Asymmetry in the assembly of the RNAi enzyme complex. Cell 115, 199-208. Selbach, M., Schwanhausser, B., Thierfelder, N., Fang, Z., Khanin, R., and Rajewsky, N. (2008). Widespread changes in protein synthesis induced by microRNAs. Nature 455, 58-63. Shin, C., Nam, J.W., Farh, K.K.H., Chiang, H.R., Shkumatava, A., and Bartel, D.P. (2010). Expanding the MicroRNA Targeting Code: Functional Sites with Centered Pairing. Molecular Cell 38, 789-802. Sturm, M., Hackenberg, M., Langenberger, D., and Frishman, D. (2010). TargetSpy: a supervised machine learning approach for microRNA target prediction. BMC Bioinformatics 11. Tan, S.M., Kirchner, R., Jin, J., Hofmann, O., McReynolds, L., Hide, W., and Lieberman, J. (2014). Sequencing of Captive Target Transcripts Identifies the Network of Regulated Genes and Functions of Primate-Specific miR-522. Cell Reports 8, 1225-1239. Tautz, D., and Domazet-Loso, T. (2011). The evolutionary origin of orphan genes. Nat Rev Genet 12, 692-702. Thai, T.H., Calado, D.P., Casola, S., Ansel, K.M., Xiao, C., Xue, Y., Murphy, A., Frendewey, D., Valenzuela, D., Kutok, J.L., et al. (2007). Regulation of the germinal center response by microRNA-155. Science 316, 604-608. Ule, J., Jensen, K.B., Ruggiu, M., Mele, A., Ule, A., and Darnell, R.B. (2003). CLIP identifies Nova-regulated RNA networks in the brain. Science 302, 1212-1215. van Rooij, E., Sutherland, L.B., Qi, X., Richardson, J.A., Hill, J., and Olson, E.N. (2007). Control of stress-dependent cardiac growth and gene expression by a microRNA. Science 316, 575-579. Vaucheret, H., Vazquez, F., Crete, P., and Bartel, D.P. (2004). The action of ARGONAUTE1 in the miRNA pathway and its regulation by the miRNA pathway are crucial for plant development. Genes Dev 18, 1187-1197. Vejnar, C.E., and Zdobnov, E.M. (2012). MiRmap: comprehensive prediction of microRNA target repression strength. Nucleic Acids Res 40, 11673-11683. Ventura, A., Young, A.G., Winslow, M.M., Lintault, L., Meissner, A., Erkeland, S.J., Newman, J., Bronson, R.T., Crowley, D., Stone, J.R., et al. (2008). Targeted deletion reveals essential and overlapping functions of the miR-17 through 92 family of miRNA clusters. Cell 132, 875-886. Wang, X.W., and El Naqa, I.M. (2008). Prediction of both conserved and nonconserved microRNA targets in animals. Bioinformatics 24, 325-332. Wen, J., Parker, B.J., Jacobsen, A., and Krogh, A. (2011). MicroRNA transfection and AGO-bound CLIP-seq data sets reveal distinct determinants of miRNA action. 35 RNA 17, 820-834. Wheeler, B.M., Heimberg, A.M., Moy, V.N., Sperling, E.A., Holstein, T.W., Heber, S., and Peterson, K.J. (2009). The deep evolution of metazoan microRNAs. Evol Dev 11, 50-68. Wightman, B., Burglin, T.R., Gatto, J., Arasu, P., and Ruvkun, G. (1991). Negative regulatory sequences in the lin-14 3'-untranslated region are necessary to generate a temporal switch during Caenorhabditis elegans development. Genes Dev 5, 1813-1824. Wightman, B., Ha, I., and Ruvkun, G. (1993). Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell 75, 855-862. Wray, G.A. (2007). The evolutionary significance of cis-regulatory mutations. Nat Rev Genet 8, 206-216. Xiao, C., Calado, D.P., Galler, G., Thai, T.H., Patterson, H.C., Wang, J., Rajewsky, N., Bender, T.P., and Rajewsky, K. (2007). MiR-150 controls B cell differentiation by targeting the transcription factor c-Myb. Cell 131, 146-159. Xie, X., Lu, J., Kulbokas, E.J., Golub, T.R., Mootha, V., Lindblad-Toh, K., Lander, E.S., and Kellis, M. (2005). Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature 434, 338-345. Yekta, S., Shih, I.H., and Bartel, D.P. (2004). MicroRNA-directed cleavage of HOXB8 mRNA. Science 304, 594-596. Yi, R., Qin, Y., Macara, I.G., and Cullen, B.R. (2003). Exportin-5 mediates the nuclear export of pre-microRNAs and short hairpin RNAs. Genes & Development 17, 3011-3016. Zhao, Y., Ransom, J.F., Li, A., Vedantham, V., von Drehle, M., Muth, A.N., Tsuchihashi, T., McManus, M.T., Schwartz, R.J., and Srivastava, D. (2007). Dysregulation of cardiogenesis, cardiac conduction, and cell cycle in mice lacking miRNA-1-2. Cell 129, 303-317. 36 Chapter 2. Predicting effective microRNA target sites in mammalian mRNAs Vikram Agarwal1,2,3, George W. Bell1, Jin-Wu Nam1,2,4, David P. Bartel1,2 1Howard Hughes Medical Institute and Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA 2Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA 3Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA 4Department of Life Science, College of Natural Sciences, Hanyang University, Seoul 133-791, Korea V.A. carried out computational analysis. G.W.B. overhauled the TargetScan website. J- W.N. helped process 3P-seq data. V.A. and D.P.B. conceived of the project, designed the analyses, and wrote the paper. Published as: Agarwal V, Bell GW, Nam J-W, Bartel DP. "Predicting effective microRNA target sites in mammalian mRNAs". eLife 4:e05005. 37 Abstract MicroRNA targets are often recognized through pairing between the miRNA seed region and complementary sites within target mRNAs, but not all of these canonical sites are equally effective, and both computational and in vivo UV-crosslinking approaches suggest that many mRNAs are targeted through non-canonical interactions. Here, we show that recently reported non-canonical sites do not mediate repression despite binding the miRNA, which indicates that the vast majority of functional sites are canonical. Accordingly, we developed an improved quantitative model of canonical targeting, using a compendium of experimental datasets that we pre-processed to minimize confounding biases. This model, which considers site type and another 14 features to predict the most effectively targeted mRNAs, performed significantly better than existing models and was as informative as the best high-throughput in vivo crosslinking approaches. It drives the latest version of TargetScan (v7.0; targetscan.org), thereby providing a valuable resource for placing miRNAs into gene-regulatory networks. Introduction MicroRNAs (miRNAs) are ~22-nt RNAs that mediate post-transcriptional gene repression (Bartel, 2004). Bound with an Argonaute protein to form a silencing complex, miRNAs function as sequence-specific guides, directing the silencing complex to transcripts, primarily through Watson–Crick pairing between the miRNA seed (miRNA nucleotides 2–7) and complementary sites within the 3′ untranslated regions (3′ UTRs) of target RNAs (Lewis et al., 2005; Bartel, 2009). The miRNAs conserved to fish have been grouped into 87 families, each with a unique seed region. On average, each of these 38 families has >400 conserved targeting interactions, and together these interactions involve most mammalian mRNAs (Friedman et al., 2009). In addition, many nonconserved interactions also function to reduce mRNA levels and protein output (Farh et al., 2005; Krutzfeldt et al., 2005; Lim et al., 2005; Baek et al., 2008; Selbach et al., 2008). Accordingly, miRNAs have been implicated in a wide range of biological processes in worms, flies, and mammals (Kloosterman and Plasterk, 2006; Bushati and Cohen, 2007; Stefani and Slack, 2008). Critical for understanding miRNA biology is the accurate prediction of miRNA–target interactions. Although numerous advances have been made, accurate and specific target predictions remain a challenge. Analysis of preferentially conserved miRNA-pairing motifs within 3′ UTRs has led to the identification of several classes of target sites (Bartel, 2009). The most effective canonical site types, listed in order of decreasing preferential conservation and efficacy, are the 8mer site [Watson–Crick match to miRNA positions 2–8 with an A opposite position 1 (Lewis et al., 2005)], 7mer-m8 site [position 2–8 match (Lewis et al., 2003; Brennecke et al., 2005; Krek et al., 2005; Lewis et al., 2005)], and 7mer-A1 site [position 2–7 match with an A opposite position 1 (Lewis et al., 2005)]. Experiments have confirmed that the preference for an adenosine opposite position 1 is independent of the miRNA nucleotide identity (Grimson et al., 2007; Nielsen et al., 2007; Baek et al., 2008) and due to the specific recognition of the target adenosine within a binding pocket of Argonaute (Schirle et al., 2014). Two other canonical site types, each associated with weaker preferential conservation and much lower efficacy (Friedman et al., 2009), are the 6mer [position 2–7 match (Lewis et al., 2005)] and offset 6mer [position 3–8 match (Friedman et al., 2009)]. Pairing to the 3′ end of the miRNA can supplement canonical 39 sites, although evidence for the use of this 3′-supplementary pairing is observed for no more than 5% of the seed-matched sites (Brennecke et al., 2005; Lewis et al., 2005; Grimson et al., 2007; Friedman et al., 2009). Some effective sites lack canonical seed pairing. For example, very extensive pairing to the 3′ region of the miRNA can compensate for a wobble or mismatch to one of the seed positions (Brennecke et al., 2005; Bartel, 2009), as exemplified by the two let- 7 sites within the 3′ UTR of C. elegans lin-41 (Reinhart et al., 2000). Although these 3′- supplementary sites can be detected above background when searching for preferentially conserved pairing configurations, they are exceedingly rare, with conserved miRNA families in mammals and nematodes each averaging <1 preferentially conserved 3′- supplementary site (Friedman et al., 2009; Jan et al., 2011). Other relatively rare, yet effective sites include centered sites, which have 11–12 contiguous Watson–Crick pairs to the center of the miRNA (Shin et al., 2010), and cleavage sites, which have the very extensive pairing required for Argonaute-catalyzed slicing of the mRNA (Yekta et al., 2004; Davis et al., 2005; Karginov et al., 2010; Shin et al., 2010). The existence of additional, still-to-be-characterized types of non-canonical sites is suggested by the large number of mRNA regions that crosslink to the silencing complex in vivo yet lack known site types matching the cognate miRNA (Chi et al., 2012; Loeb et al., 2012; Helwak et al., 2013; Khorshid et al., 2013; Grosswendt et al., 2014). With the prediction of hundreds of conserved targets for most mammalian miRNAs (and even more nonconserved targets), knowing which targets are expected to be most responsive to each miRNA provides important information for both large-scale network analyses and detailed experimental follow-up. As previously mentioned, the type 40 of site (e.g., whether the site is an 8mer or a 7mer-A1) strongly influences the efficacy of repression. The number of sites also influences efficacy, with each additional site typically acting independently to impart additional repression (Grimson et al., 2007; Nielsen et al., 2007), although sites between 8–40 nt of each other tend to act cooperatively, and those < 8 nt of each other act competitively (Grimson et al., 2007). Additional features of site context help explain why a given site (e.g., a 7mer-m8 site to miR-1) can be more effective in one 3′ UTR than it is in another. These features include the positioning of the site outside of the path of the ribosome [which includes the first 15 nt of the 3′ UTR (Grimson et al., 2007)] and the positioning of the site within 3′-UTR segments that are more accessible to the silencing complex, as measured by either high local AU content (Grimson et al., 2007; Nielsen et al., 2007), high AU content of the entire 3′ UTR (Robins and Press, 2005; Hausser et al., 2009), shorter distance from a 3′- UTR terminus (Grimson et al., 2007), shorter 3′-UTR length (Hausser et al., 2009; Betel et al., 2010; Wen et al., 2011; Reczko et al., 2012), or less stable predicted competing secondary structure (Robins et al., 2005; Ameres et al., 2007; Kertesz et al., 2007; Long et al., 2007; Tafer et al., 2008). Conserved sites are also more effective, in part because they tend to reside in more favorable contexts (Grimson et al., 2007; Nielsen et al., 2007). Features of the miRNA can also influence site efficacy, with sites being more effective if the miRNA has lower target-site abundance (TA) (Arvey et al., 2010; Garcia et al., 2011) and stronger predicted seed-pairing stability (SPS) (Garcia et al., 2011). Multiple features can be considered together to build quantitative models of targeting efficacy (Grimson et al., 2007; Nielsen et al., 2007; Wang and El Naqa, 2008; Betel et al., 2010; Liu et al., 2010; Garcia et al., 2011; Wen et al., 2011; Reczko et al., 41 2012; Vejnar and Zdobnov, 2012; Marin et al., 2013; Gumienny and Zavolan, 2015). Our recent model, called the context-plus (context+) model, considers the features of our original context scores [i.e., site type, 3′-supplementary pairing, local AU content, and distance from the closest 3′-UTR end (Grimson et al., 2007)], plus two miRNA features [TA and SPS (Garcia et al., 2011)]. Although the context+ model was trained using multiple regression on 74 high-throughput datasets, the features used to distinguish effective sites (the three features of the original context scores) were identified using only 11 datasets, implying that additional features might be identified through analysis of the additional datasets. Here, we examined the function of non-canonical binding sites identified in recent studies and found that mRNAs with these sites are not more repressed than mRNAs without sites, despite finding compelling evidence that many of these noncanocial sites bind the silencing complex in vivo. This finding justified a focus on the statistical modeling of canonical, seed-matched sites within 3′ UTRs, which mediate the vast majority of repression that can be predicted with current methods. To this end, we pre- processed the 74 datasets to minimize confounding biases and then used stepwise regression to identify the most informative features from a large set of potential targeting features. This approach unbiasedly selected 14 features, which were combined to develop the context++ model of miRNA targeting efficacy. The context++ model was more predictive than any published model and at least as predictive as the most informative in vivo crosslinking approaches. As the engine powering the latest version of TargetScan (v7.0; targetscan.org), this model provides a valuable resource for placing the miRNAs of human, mouse, zebrafish, and other vertebrate species into their respective gene- 42 regulatory networks. Results Inefficacy of recently reported non-canonical binding sites Several high-throughput crosslinking-immunoprecipitation (CLIP) approaches have been applied to identify sites that bind Argonaute in vivo (Chi et al., 2009; Hafner et al., 2010; Helwak et al., 2013; Grosswendt et al., 2014). These experiments all observe significant enrichment for cognate seed-matched sites in the vicinity of the crosslinks, which validates their ability to detect authentic sites. Despite this enrichment, some crosslinks do not correspond to canonical sites to the relevant miRNAs, raising the prospect that these results might reveal novel types of non-canonical binding that could mediate repression. Indeed, five studies have reported crosslinking to non-canonical binding sites proposed to mediate repression (Chi et al., 2012; Loeb et al., 2012; Helwak et al., 2013; Khorshid et al., 2013; Grosswendt et al., 2014). In addition, another biochemical study has reported the identification of non-canonical sites without using any crosslinking (Tan et al., 2014). Reasoning that these experimental datasets might provide a resource for defining of novel types of sites to be used in target prediction, we re-examined the functionality of these sites in mediating target mRNA repression. We first examined the efficacy of “nucleation-bulge” sites (Chi et al., 2012), which were identified from analysis of differential CLIP (dCLIP) results reporting the clusters that appear in the presence of miR-124 (Chi et al., 2009). Nucleation-bulge sites consist of 8 nt motifs paired to positions 2–8 of their cognate miRNA seed, with the nucleotide opposing position 6 protruding as a bulge but sharing Watson-Crick 43 complementarity to miRNA position 6. Meta-analysis of miRNA and small-RNA transfection datasets revealed significant repression of mRNAs with the canonical site types but found no evidence for repression of mRNAs that contain nucleation-bulge sites but lack perfectly paired seed-matched sites in their 3′ UTRs (Figure 1–figure supplements 1A–B). Reasoning that the nucleation-bulge site might be only marginally effective, we examined the early zebrafish embryo with and without Dicer, analyzing the targeting by miR-430, the most highly expressed miRNA of the early embryo. Even in this system, one of the most sensitive systems for detecting the effects of targeting (where a robust repression is observed for mRNAs with only a single 6mer or offset 6mer sites to miR-430), we observed no evidence for repression of mRNAs with nucleation-bulge sites to miR-430 (Figure 1A, Figure 1–figure supplement 1C, and Figure 1–figure supplement 4A). Because the nucleation-bulge sites were originally identified and characterized as sites to miR-124, we next tried focusing on only miR-124–mediated repression. However, even in this more limited context, the mRNAs with nucleation-bulge sites were no more repressed than mRNAs without sites (Figure 1–figure supplements 1D–F). Another study examined the response of 32 mRNAs that lack canonical miR-155 sites yet crosslink to Argonaute in wild-type T cells but not T cells isolated from miR-155 knockout mice (Loeb et al., 2012). As previously observed, we found that the levels of these mRNAs tended to increase in T cells lacking miR-155 (Figure 1B). However, a closer look at the distribution of mRNA fold changes between wild-type and knockout cells revealed a pattern not normally observed for mRNAs with a functional site type. As illustrated for the mRNAs with canonical sites (including those supported by CLIP), 44 when a miRNA is knocked out, the cumulative distribution of fold changes for mRNAs with functional site types diverges most from the no-site distribution at the top of the curve, which represents the most strongly derepressed mRNAs (Figure 1B). However, for the mRNAs harboring non-canonical miR-155 sites, the distribution of fold changes converged with the no-site distribution at the top of the curve (Figure 1B), raising doubt as to whether non-canonical binding of these mRNAs mediates repression. To investigate these mRNAs further, we examined their response to the miR-155 loss in helper T cell subtypes 1 and 2 (Th1 and Th2, respectively) and B cells, which are other lymphocytic cells in which significant derepression of miR-155 targets is observed in cells lacking miR-155 (Rodriguez et al., 2007; Eichhorn et al., 2014). In contrast to mRNAs with canonical sites, the mRNAs with non-canonical sites showed no evidence of derepression in the knockout cells of each of these cell types, which reinforced the conclusion that non-canonical binding of miR-155 does not lead to repression of these mRNAs (Figure 1C and Figure 1–figure supplement 2). We next probed the functionality of non-canonical interactions identified by CLASH (crosslinking, ligation, and sequencing of hybrids), a high-throughput technique that generates miRNA–mRNA chimeras, which each identify a miRNA and the mRNA region that it binds (Helwak et al., 2013). As previously observed, mRNAs with CLASH- identified non-canonical interactions involving miR-92 tended to be slightly up-regulated upon knockdown of miR-92 in HEK293 cells (Figure 1D). However, a closer look at the mRNA fold-change distributions again revealed a pattern not typically observed for mRNAs with a functional site type, with convergence with the no-site distribution in the region expected to be most divergent. Therefore, we examined a second dataset 45 monitoring mRNA changes after knocking down miR-92 and other miRNAs in HEK293 cells (Hafner et al., 2010). As reported recently (Wang, 2014), the slight up-regulation observed for mRNAs with CLASH-identified non-canonical interactions in the original dataset was not reproducible in the second dataset (Figure 1E). Moreover, mRNAs with non-canonical interactions to other miRNAs showed no sign of derepression when the cognate miRNAs were knocked down (Figure 1–figure supplement 3A–B). To mirror the original analysis of CLASH-identified interactions (Helwak et al., 2013), our analysis included sites located in any region of the mature mRNA (Figures 1D–E and Figure 1– figure supplement 3A). No significant difference from the no-site control distribution was observed when restricting our analysis to mRNAs with CLASH-identified non-canonical sites in their 3′ UTRs (Figure 1–figure supplement 3B). Many miRNA–mRNA chimeras can also be found in standard AGO CLIP datasets, presumably generated by an endogenous ligase acting in cell lysates during workup (Grosswendt et al., 2014). Global experiments examining function of these interactions group the mRNAs with non-canonical interactions together with those with canonical interactions (Grosswendt et al., 2014), and thus the signal for function might arise from only canonical interactions. Indeed, when we re-examined the response of these mRNAs to miRNA knockdown, those with chimera-identified canonical sites tended to be derepressed, whereas those with only chimera-identified non-canonical sites did not (Figure 1F and Figure 1–figure supplements 3C–E). Although at first glance this finding might seem at odds with the elevated evolutionary conservation of chimera- identified non-canonical sites (Grosswendt et al., 2014), we found that this conservation signal was not smaller for the sites of less conserved miRNAs and therefore was not 46 indicative of functional miRNA binding (Figure 1–figure supplement 5). Instead, this signal might occur for the same reason that artificial sRNAs tend to target conserved regions of 3′ UTRs (Nielsen et al., 2007). Next, we evaluated the response of non-canonical sites modeled by MIRZA, an algorithm that utilizes CLIP data in conjunction with a biophysical model to predict target sites (Khorshid et al., 2013). As noted by others (Majoros et al., 2013), the definition of non-canonical MIRZA sites was more expansive than that used elsewhere and did not exclude sites with canonical 6mer or offset 6mer seed matches. Indeed, when focusing on only targets without 6mer or offset 6mer seed matches, the top 100 non- canonical MIRZA targets showed no sign of efficacy (Figure 1G). Finally, we examined non-canonical clusters identified by IMPACT-seq (identification of miRNA-responsive elements by pull-down and alignment of captive transcripts—sequencing), a method that sequences mRNA fragments that co-purify with a biotinylated miRNA without crosslinking (Tan et al., 2014). Although the mRNAs with an IMPACT-seq-supported canonical site were down-regulated upon the transfection of the cognate miRNA, those with an IMPACT-seq-supported non-canonical site responded no differently than mRNAs lacking a site (Figure 1H). Collectively, the novel non-canonical sites recently identified in high-throughput CLIP and other biochemical studies imparted no detectable repression when monitoring mRNA changes. The same was true when examining ribosome-profiling or proteomic datasets to capture repression also occurring at the level of translation (Figure 1–figure supplement 4). All of our analyses of experimentally identified non-canonical sites examined the 47 ability of the sites to act in mRNAs that had no seed-matched site to the same miRNA in their 3′ UTRs. Any non-canonical site found in a 3′ UTR that also had a seed-matched site to the same miRNA was not considered because any response could be attributed to the canonical site. At first glance, excluding these co-occuring sites might seem to allow for the possibility that the experimentally identified non-canonical sites could contribute to repression when in the same 3′ UTR as a canonical site, even though they are ineffective in 3′ UTRs without canonical sites. However, in mammals, canonical sites to the same miRNA typically act independently (Grimson et al., 2007; Nielsen et al., 2007), and we have no reason to think that non-canonical sites would behave differently. More importantly, although the non-canonical sites examined were in mRNAs that had no seed-matched 3′-UTR site to the same miRNA, most were in mRNAs that had seed- matched 3′-UTR sites to other miRNAs that were highly expressed in the cells. Therefore, even if the non-canonical sites could only function when coupled to a canonical site, we would have observed a signal for their function in our analyses. Confirmation that miRNAs bind to non-canonical sites despite their inefficacy The inefficacy of recently reported non-canonical sites was surprising when considering evidence that the dCLIP clusters without cognate seed matches are nonetheless enriched for imperfect pairing to the miRNA, which would not be expected if those clusters were merely non-specific background (Chi et al., 2012; Loeb et al., 2012). Indeed, our analysis of motifs within the dCLIP clusters for miR-124 and miR-155 confirmed that those without a canonical site to the miRNA were enriched for miRNA pairing (Figure 2A). Although one of the motifs identified within CLIP clusters that appeared after 48 transfection of miR-124 into HeLa cells yet lacked a canonical miR-124 site did not match the miRNA (Figure 2–figure supplement 1C), the top motif, as identified by MEME (Bailey and Elkan, 1994), had striking complementarity to the miR-124 seed region (Figure 2A). This human miR-124 non-canonical motif matched the “nucleation- bulge” motif originally found for miR-124 in the mouse brain (Chi et al., 2012). Although the top motif identified within the subset of miR-155 dCLIP clusters that lacked a canonical site to miR-155 was not identified with confidence, it had only a single mismatch to the miR-155 seed, which would not have been expected for a motif identified by chance. Previous analysis of CLASH-identified interactions shows that the top MEME- identified motifs usually pair to the miRNA, although for many miRNAs this pairing falls outside of the seed region (Helwak et al., 2013). Repeating this analysis, but focusing on only interactions without canonical sites, confirmed this result (Figure 2B) (Helwak et al., 2013). Applying this type of analysis to non-canonical interactions identified from miRNA–mRNA chimeras in standard AGO CLIP datasets confirmed that these interactions are also enriched for pairing to the miRNA (Grosswendt et al., 2014). As previously shown (Grosswendt et al., 2014), these interactions were more specific to the seed region than were the CLASH-identified interactions (Figure 2B). Comparison of all the chimera data with all the CLASH data showed that a higher fraction of the chimeras captured canonical interactions and that a higher fraction captured interactions within 3′ UTRs (Figure 2–figure supplement 1A). These results, implying that the chimera approach is more effective than CLASH at capturing functional sites that mediate repression, motivated a closer look at the chimera-identified interactions that lacked a 49 canonical site, despite our finding that these interactions do not mediate repression. In the human and nematode datasets (and less so in the mouse dataset), these interactions were enriched for motifs that corresponded to non-canonical sites that paired to the miRNA seed region (Figure 2B and Figure 2–figure supplement 2). Inspection of these motifs revealed that the most enriched nucleotides typically preserved Watson–Crick pairing in a core 4–5 nts within the seed region, with tolerance to mismatches or G:U wobbles observed at varied positions, depending on the miRNA, potentially reflecting seed- specific structural or energetic features, or perhaps context-dependent biases in crosslinking or ligation (Figure 2C and Figure 2–figure supplement 1B). Motifs for only a few miRNAs had a bulged nucleotide, and if a bulge was observed it was in the mRNA strand and not in the miRNA strand, as expected if the Argonaute protein imposed geometric constraints in the seed of the miRNA. The miR- 124 nucleation-bulge site was enriched in mouse chimera interactions (Figure 2–figure supplement 2A), as it had been in the human and mouse dCLIP clusters (Figure 2A) (Chi et al., 2012). However, despite identification of this miR-124 interaction in datasets from two methods and two species, this style of bulged pairing was not detected for any other miRNA. Interestingly, for all other cases in which a bulge in the recognition motif was observed (human miR-33 and miR-374, and C. elegans miR-50 and miR-58), the bulge was between the nucleotides that paired to miRNA nucleotides 4 and 5 (Figure 2–figure supplement 1B and Figure 2–figure supplement 2B). A bulge is observed between the analogous nucleotides of validated targets of Arabidopsis miR398 (Jones-Rhoades and Bartel, 2004), whereas single-nucleotide bulges between other seed-pairing positions have not been reported in other validated plant targets. A bulge between these nucleotides 50 is also observed in the first let-7 site in the C. elegans lin-41 3′ UTR, one of the archetypal 3′-compensatory sites (Reinhart et al., 2000; Bartel, 2009). Taken together, these observations suggest that the most tolerated bulge in miRNA seed pairing is between the target nucleotides that pair to miRNA nucleotides 4 and 5. Some motifs, particularly the more degenerate ones, were found in most of the interactions, whereas other motifs were found in only a small minority (Figure 2C and Figure 1–figure supplement 1B). We suspect that many of the interactions lacking the top-scoring motifs also involve non-canonical binding sites, some of which might function through degenerate versions of the motif that happened to have scored highest in the MEME analysis. Nonetheless, some interactions or CLIP clusters lacking the top- scoring motifs might represent background (Friedersdorf and Keene, 2014), and indeed a few with the motif or even with a canonical site might represent background. In sum, our analyses of the CLIP datasets confirmed that many of the CLIP clusters and CLASH/chimera interactions lacking a seed match nonetheless capture authentic miRNA-binding sites—otherwise the top enriched motifs would not pair so often to the cognate miRNA. Despite this ability to bind the miRNA in vivo and to function in the sense that they contribute to cellular target-site abundance (Denzler et al., 2014), we classify the CLIP-identified non-canonical sites as non-functional with respect to repression because they showed no sign of mediating repression (Figure 1 and Figure 1–figure supplements 1–4). Thus, the only known non-canonical site types that mediate repression are the 3′-supplementary, centered, and cleavage site types, which together comprise <1% of the effective sites that currently can be predicted (Friedman et al., 2009; Shin et al., 2010). Although we cannot exclude the possibility that additional types of 51 functional non-canonical sites might exist but have not yet been characterized to the point that they can be used for miRNA target prediction (Lal et al., 2009), our analysis of the CLIP results justified a focus on the abundant site types that are predictive of targeting and at least marginally functional, i.e., the canonical seed-matched sites, including 6mer and offset 6mer sites. Improving dataset quality for model development To identify features involved in mammalian miRNA targeting, we analyzed the results of microarray datasets reporting the mRNA changes after transfecting either a miRNA or siRNA (together referred to as small RNAs, abbreviated as sRNAs) into HeLa cells. From the published datasets, we used the set of 74 experiments that had previously been selected because each 1) had a clear signal for sRNA-based repression, 2) was acquired using the same Agilent array platform, and 3) reported on the effects of a unique seed sequence (Garcia et al., 2011). Despite the differences among the 74 transfected sRNAs, mRNA fold changes of some arrays were highly correlated with those of others, which indicated that sRNA- independent effects dominated (Figure 3A). When all 74 datasets were compared against each other, those from either the same group of experiments (Anderson et al., 2008) or the same transfection protocol (Jackson et al., 2006a; Jackson et al., 2006b; Grimson et al., 2007) tended to cluster strongly together based on their common transcriptome-wide responses to different transfected sRNAs (Figure 3B), indicating the likely presence of batch effects (Leek et al., 2010) that could obscure detection of features associated with miRNA targeting. 52 A parameter known to confound the accurate measurement of mRNA responses on microarrays is the relative AU content within 3′ UTRs (Elkon and Agami, 2008). Indeed, when considering mRNAs without a canonical site to the transfected sRNA, we found that 3′-UTR AU content often correlated with mRNA fold changes. Moreover, the extent and direction of the correlation was similar for different datasets from the same publication but differed when comparing to datasets from other publications (Figure 3C). A second parameter that helped explain the correlated sRNA-independent effects for related datasets was 3′-UTR length (Saito and Satrom, 2012), which exhibited patterns of correlation similar to those observed for 3′-UTR AU content (Figure 3C). Our observation that AU content and 3′-UTR length correlated so differently with global expression changes when comparing results from different publications helps explain why different 3′-UTR features previously seemed to have such variable predictive power in different experimental contexts (Hausser et al., 2009; Wen et al., 2011; Gumienny and Zavolan, 2015). Another phenomenon known to systematically perturb the levels of mRNAs without sites to the transfected sRNA is the derepression of mRNAs with sites for endogenous miRNAs, presumably through competition between the transfected sRNA and the endogenous miRNAs for limiting components of the silencing pathway (Khan et al., 2009; Saito and Satrom, 2012). Statistically significant derepression was indeed observed for mRNAs with sites to eight of the 10 miRNA families most frequently sequenced in HeLa cells (Figure 3–figure supplements 1A–B). To correct for biases that were independent of the sequence of the introduced sRNA, we used partial least-squares regression (PLSR) to estimate—for each transfection 53 experiment—the component of the transcriptome response that was similar in other highly correlated experiments, and we then subtracted this estimate from the observed response (Supplementary file 1). Applying our technique to all the mRNAs in each of the 74 datasets largely eliminated the correlations observed between datasets (Figures 3D–E), as well as the correlations observed between mRNA fold changes and either AU content or 3′-UTR length (Figure 3F), which lowered the risk that these effects that are independent of the sRNA sequence would confound subsequent analyses of sRNA targeting efficacy. Moreover, our technique eliminated the signal for derepression of endogenous miRNA targets (Figure 3–figure supplement 1C), suggesting that it did the same for any other biases unrelated to the sequence of the transfected sRNA that have yet to be identified. Reducing these biases substantially reduced the variance in the response for mRNAs without sites to the sRNA, which substantially enhanced the net signal for sRNA-mediated repression of site-containing mRNAs observed in individual arrays (Figure 3G) and all arrays in aggregate (Figure 3H). Previous studies of miRNA targeting have relied on 3′-UTR annotations from databases such as RefSeq, without accounting for abundant alternative 3′-UTR isoforms present in the tissue or cell line of interest (Tian et al., 2005). The presence of more than one abundant 3′-UTR isoform for a gene would confound interpretation of 3′-UTR- related features, such as 3′-UTR length, or distance from the closest 3′-UTR end (Nam et al., 2014). Moreover, the shorter 3′-UTR isoforms might not include some target sites, which would cause these sites to appear ineffective when in fact they are not present (Sandberg et al., 2008; Mayr and Bartel, 2009; Lianoglou et al., 2013; Nam et al., 2014). To avoid these complications, we examined 3′-UTR isoform quantifications previously 54 generated for HeLa cells (Nam et al., 2014) using poly(A)-position profiling by sequencing (3P-seq) (Jan et al., 2011), and developed our model using the dominant mRNA from the subset of genes for which ≥90% of the 3P-seq tags corresponded to a single 3′-UTR isoform. To isolate the effects of single sites, we also used the subset of these mRNAs for which the 3′ UTR possessed a single seed match to the transfected sRNA (Supplementary file 1). Selecting features and building a regression model for target prediction To improve our model of mammalian target-site efficacy, we considered 26 features as potentially informative of efficacy. These included features of the sRNAs, features of the sites (including their contexts and positions within the mRNAs), and features of the mRNAs, many of which had been used or at least considered in previous efforts (Table 1). One of the 26 features was site PCT (probability of conserved targeting), which estimates the probability of the site being preferentially conserved because it is targeted by the cognate miRNA (Friedman et al., 2009). Prior to use, our PCT scores were updated to take advantage of improvements in both mouse and human 3′-UTR annotations (Harrow et al., 2012; Flicek et al., 2014), the additional sequenced vertebrate genomes aligned to the mouse and human genomes (Karolchik et al., 2014), and our expanded set of miRNA families broadly conserved among vertebrate species, which increased from 87 to 111 families. Using these updates increased sensitivity, with our estimate for the number of human 3′-UTR sites conserved above background increasing from ~46,400 (Friedman et al., 2009) to ~62,300. The PCT score on its own correlates with site efficacy, 55 and when using the same set of 3′ UTRs this correlation increased only modestly for the new scores (data not shown), consistent with the notion that the evolutionary signal was already nearly saturated in the previous analysis of 23 species spanning the vertebrate tetrapods (Friedman et al., 2009). Nonetheless, we used our updated PCT score as a feature for sites of broadly conserved miRNAs within our training set. A second feature that we re-evaluated was the predicted structural accessibility of the site. As scored previously, the degree to which the site nucleotides were predicted to be free of pairing to flanking 3′-UTR regions was not informative after controlling for the contribution of local AU content (Grimson et al., 2007). However, analysis inspired by work on siRNA site accessibility (Tafer et al., 2008) suggested an improved scoring scheme for this feature. For this analysis we used RNAplfold (Bernhart et al., 2006) to predict the unpaired probabilities for variable-sized windows in the proximity of the site and then examined the relationship between these probabilities and the repression associated with sites in our compendium of normalized datasets, while controlling for local AU content and other features of the context+ model (Figure 4A). Based on these results, which resembled those reported previously (Tafer et al., 2008), we scored predicted structural accessibility (SA) as proportional to the log10 value of the unpaired probability for a 14-nt region centered on the match to miRNA nucleotides 7 and 8. Having assembled a set of candidate features, we used the stepAIC function from the “MASS” R package (Venables and Ripley, 2002) to determine which features were most useful for modeling site efficacy. This function uses stepwise regression to build models with increasing numbers of features until it reaches the optimal Akaike Information Criterion (AIC) value. The AIC evaluates the tradeoff between the benefit of 56 increasing the likelihood of the regression fit and the cost of increasing the complexity of the model by adding more variables. For each of the four seed-matched site types, models were built for 1000 samples of the dataset. Each sample included 70% of the mRNAs with single sites to the transfected sRNA from each experiment (randomly selected without replacement), reserving the remaining 30% as a test set. Compared to our context-only and context+ models (Grimson et al., 2007; Garcia et al., 2011), the new stepwise regression models were significantly better at predicting site efficacy when evaluated using their corresponding held-out test sets, as illustrated for the each of four site types (Figure 4B). Reasoning that features most predictive would be robustly selected, we focused on 14 features selected in nearly all 1000 bootstrap samples for at least two site types (Table 1). These included all three features considered in our original context-only model (minimum distance from 3′-UTR ends, local AU composition and 3′-supplementary pairing), the two added in our context+ model (SPS and TA), as well as nine additional features (3′-UTR length, ORF length, predicted structural accessibility, the number of offset 6mer sites in the 3′ UTR and 8mer sites in the ORF, the nucleotide identity of position 8 of the target, the nucleotide identity of positions 1 and 8 of the sRNA, and site conservation). Other features were frequently selected for only one site type (e.g., ORF 7mer-A1 sites, ORF 7mer-m8 sites, and 5′-UTR length; Table 1). Presumably these and other features were not robustly selected because either their correlation with targeting efficacy was very weak (e.g., the 7 nt ORF sites) or they were strongly correlated to a more informative feature, such that they provided little additional value beyond that of the more informative feature (e.g., 3′-UTR AU content compared to the more informative 57 feature, local AU content). Using the 14 robustly selected features, we trained multiple linear regression models on all of the data. The resulting models, one for each of the four site types, were collectively called the context++ model (Figure 4C and Figure 4–Source data 1). For each feature, the sign of the coefficient indicated the nature of the relationship. For example, mRNAs with either longer ORFs or longer 3′ UTRs tended to be more resistant to repression (indicated by a positive coefficient), whereas mRNAs with either structurally accessible target sites or ORF 8mer sites tended to be more prone to repression (indicated by a negative coefficient). Based on the relative magnitudes of the regression coefficients, some newly incorporated features such as 3′-UTR length and ORF length contributed similarly to features previously incorporated in the context+ model, such as SPS, TA, and local AU (Figure 4C). New features with an intermediate level of influence included the number of ORF 8mer sites and site conservation as well as the presence of a 5′ G in the sRNA (Figure 4C), the latter perhaps a consequence of differential sRNA loading efficiency. The weakest features included the sRNA and target position 8 identities as well as the number of offset 6mer sites. The identity of sRNA nucleotide 8 exhibited a complex pattern that was site-type dependent. Relative to a position-8 U in the sRNA, a position-8 C further decreased efficacy of sites with a mismatch at this position (6mer or 7mer-A1 sites), whereas a position-8 A had the opposite effect (Figure 4C). Similarly, a position-8 C in the site also conferred decreased efficacy of 6mer and 7mer-A1 sites relative to a position-8 U in the target (Figure 4C). Allowing interaction terms when developing the model, including a term that captured the potential interplay between these positions, did not provide sufficient benefit to 58 justify the more complex model. Improvement over previous methods We compared the predictive performance of our context++ model to that of the most recent versions of seventeen in silico tools for predicting miRNA targets, including AnTar (Wen et al., 2011), DIANA-microT-CDS (Reczko et al., 2012), ElMMo (Gaidatzis et al., 2007), MBSTAR (Bandyopadhyay et al., 2015), miRanda-MicroCosm (Griffiths-Jones et al., 2008), miRmap (Vejnar and Zdobnov, 2012), mirSVR (Betel et al., 2010), miRTarget2 (Wang and El Naqa, 2008), MIRZA-G (Gumienny and Zavolan, 2015), PACCMIT-CDS (Marin et al., 2013), PicTar2 implemented for predictions conserved through mammals, chicken, or fish (PicTarM, PicTarC, and PicTarF, respectively) (Anders et al., 2012), PITA (Kertesz et al., 2007), RNA22 (Miranda et al., 2006), SVMicrO (Liu et al., 2010), TargetRank (Nielsen et al., 2007), and TargetSpy (Sturm et al., 2010); as well as successive versions of TargetScan, which offer context scores (Grimson et al., 2007), PCT scores (Friedman et al., 2009), or context+ scores (Garcia et al., 2011) as options for ranking predictions (TargetScan5, TargetScan.PCT, or TargetScan6, respectively) for either all mRNAs with a canonical 7–8 nt 3′-UTR site (TargetScan.All) or those with only broadly conserved sites (TargetScan.Cons). To the best of our knowledge, algorithms excluded from the comparison either were not de novo prediction algorithms (i.e., relying purely on consensus techniques or experimental data), did not provide a pre-computed database of results, or lacked a numerical value (or ranking) of either target-prediction confidence or mRNA responsiveness. To test the performance of the included methods, we used the results of seven microarray datasets 59 that each monitor mRNA changes after transfection of a conserved miRNA into HCT116 cells containing a hypomorphic mutant for Dicer (Linsley et al., 2007). These datasets differ from those used during development and training of our model with respect to both the cell type and the identities of the sRNAs. To prevent our model from gaining an advantage over methods that used standard 3′-UTR annotations, we used RefSeq- annotated 3′ UTRs (rather than 3P-seq–supported annotations) to generate the context++ test-set predictions, choosing the longest 3′ UTR to represent genes with multiple annotated 3′ UTRs. For each 3′ UTR containing multiple sites to the cognate miRNA, the context++ scores of individual sites were summed to generate the total context++ score to be used to rank that predicted target. The number of potential miRNA–mRNA interactions considered by the different methods varied greatly (Figure 5A), which reflected the varied strategies and priorities of these prediction efforts. Out of a concern for prediction specificity, many efforts only consider interactions involving 7–8 nt seed-matched sites. Accordingly, we first tested how well each of the methods predicted the repression of mRNAs with at least one canonical 7–8 nt 3′-UTR site (Figure 5B). The context++ model performed substantially better than the most predictive published model, which was TargetScan6.All. Of algorithms derived from other groups, DIANA-microT-CDS, miRTarget2, miRanda- miRSVR, MIRZA-G (and its derivatives), and TargetRank were the most predictive, with performance within range of TargetScan5.All (Figure 5B). Part of the reason that some algorithms performed more poorly is that they consider relatively few potential miRNA–target interactions (Figure 5A). For example, the drop in performance observed between TargetScan.All and TargetScan.Cons 60 illustrates the effect of limiting analysis to the more highly conserved sites. Nonetheless, the performance of TargetScan.Cons relative to other methods that consider relatively few sites shows that a signal can be observed in this assay even when a very limited number of interactions are scored (Figures 5A–B), presumably because much of the functional targeting is through conserved interactions. Indeed, the performance of ElMMO and TargetScan.PCT illustrate what can be achieved by scoring just the extent of site conservation and no other parameter. In an attempt to maximize prediction sensitivity, some efforts consider many interactions that lack a canonical 7–8 nt 3′-UTR site (Figure 5A). However, all of these algorithms performed poorly in predicting the response of mRNAs lacking such sites (Figure 5C). The two algorithms achieving any semblance of prediction accuracy did so by predicting some of the canonical interactions with known marginal efficacy. These were DIANA-microT-CDS, which captured modest effects of canonical sites in ORFs (Reczko et al., 2012; Marin et al., 2013), and the context++ model, which captured the modest effects of canonical 6mers in 3′ UTRs (as modified by the 14 features, which included offset 6mers and 8mer ORF sites) (Figure 5C). The algorithms designed to identify many non-canonical sites performed much more poorly in this test (r2 < 0.004), consistent with the idea that the vast majority of mRNAs without canonical sites either do not change in response to the miRNA or change in an unpredictable fashion as a secondary effect of introducing the miRNA. Another way to evaluate the performance of targeting algorithms is to examine the repression of the top predicted targets. Compared to the r2 test, this approach does not penalize efforts that either impose more stringent cutoffs to achieve higher prediction 61 specificity or implement scoring schemes that are not designed to correlate directly with site efficacy. Perhaps most importantly, this approach aligns with the goals of a biologist considering the top-ranked predictions in an attempt to focus on those most likely to undergo substantial repression. When choosing an average of 16 predicted targets for each of the seven test-set miRNAs, we found that these top 112 predictions of the context++ model were significantly more repressed than the top predictions from earlier versions of TargetScan (Figure 5D) and the top predictions of the other algorithms (Figure 5–figure supplement 1A). Despite the success of the context++ model, not all of the fold changes for its top predicted targets were negative; for the test set, the distribution of these fold changes intersected 0.0 at a cumulative fraction of 0.92, indicating that mRNAs for 8% of the top predictions increased rather than decreased with transfection of the cognate miRNA (Figure 5D). In principle, these mRNAs could still be authentic targets that are repressed in these cells but nonetheless had increased expression values because of either experimental noise or secondary effects of introducing the miRNA overwhelming the signal for miRNA-mediated repression. Alternatively, some or all of these mRNAs could be false-positive predictions. Because only half of the false-positive predictions would be expected to have positive fold changes in the presence of the miRNA, our best estimate of the upper limit on the false-positive predictions was 2 × 8%, or 16%, at this cutoff (for which an average of 16 top predictions per miRNA are considered). At the same cutoff, the distribution of fold changes for each of the previous algorithms intersected 0.0 at a cumulative fractions ranging from 0.58–0.88 (Figure 5–figure supplement 1A), which implied lower prediction specificity than that observed for the context++ model, with 62 correspondingly higher estimates for the upper limits of false positives among their top predictions, ranging from 24–84%. To evaluate the performance of top-ranked predictions more systematically, we examined median repression of the predicted targets over a broad spectrum of cutoffs, ranging from an average of 4–4096 predictions per miRNA (Figure 5E). Regardless of the cutoff, the top context++ predictions were the most repressed. The top predictions of most other algorithms were repressed significantly more than expected by chance, although the median repression of some (MBSTAR, RNA22, PACCMIT-CDS, and AnTarCLIP) did not exceed the median repression of all mRNAs with a canonical 7–8 nt 3′-UTR site (Figure 5E). Plotting average fold changes rather than median fold changes resulted in very similar relative performances (Figure 5–figure supplements 1B–C). After eliminating interactions that could involve canonical 7–8 nt 3′-UTR sites, the remaining top predictions were modestly repressed at best (Figure 5F). The most repressed predicted targets without canonical 7–8 nt 3′-UTR sites were those of the context++ model, which scored predictions with canonical 6mer 3′-UTR sites. For algorithms designed to identify many non-canonical sites, the top predictions without 7–8 nt 3′-UTR sites were essentially unresponsive to the transfected miRNA, which indicated that if effective non-canonical sites for these seven miRNAs exist, they are not enriched among the predictions of these algorithms. Similar response of targets predicted from the model and the most informative CLIP experiments We used our context++ model to overhaul the TargetScan predictions (as described in the 63 next section), and as a third way of testing this model, we compared the performance of these TargetScan7 predictions with that of in vivo CLIP experiments. When doing this comparison we took care to evaluate sets of predictions that each were the same size as the cognate set of CLIP-supported targets, whereas some previous analyses compare expansive sets of computational predictions (e.g., all mRNAs with a 6mer site) to relatively small sets of biochemically supported predictions (Chi et al., 2009; Lipchina et al., 2011; Loeb et al., 2012; Grosswendt et al., 2014; Tan et al., 2014). mRNAs with expression signals approaching the array background were not considered. This exclusion was particularly important when comparing to CLIP results; CLIP can only evaluate mRNAs expressed in the cells, which would impart a trivial relative advantage if the computational predictions included targets that appeared unresponsive because they were expressed below the array background. The non-canonical CLIP-supported targets were also not considered, as we had already shown that they do not respond to the miRNA (Figure 1 and Figure 1–figure supplements 1–4) and we did not want the inclusion of these easily recognized false positives to impart a disadvantage to CLIP. Regardless of the set of canonical CLIP-supported targets examined, we did not find a setting in which they responded significantly better than did the cohort of TargetScan7 predictions, and in some cases, the TargetScan7 predictions performed significantly better (Figures 6A–J). Similar results were observed when comparing the repression of our predictions to that of mRNAs identified biochemically without crosslinking, using either pulldown-seq or IMPACT-seq (Tan et al., 2014), again focusing on only mRNAs with canonical sites (Figures 6K–L). Thus, for identifying consequential miRNA–target interactions, the TargetScan7 model is not only more convenient than experimental determination of 64 binding sites, it is also at least as effective. The analogous conclusion was reached from analyses using the context++ model without the use of improved annotation and quantitation of 3′-UTR isoforms (data not shown). As mentioned earlier, mRNAs that increase rather than decrease in the presence of the miRNA can indicated the presence of false positives in a set of candidate targets. Examination of the mRNA fold-change distributions from the perspective of false positives revealed no advantage of the experimental approaches over our predictions. When compared to the less informative CLIP datasets, the TargetScan7 predictions included fewer mRNAs that increased, and when compared to the CLIP datasets that performed as well as the predictions, the TargetScan7 predictions included a comparable number of mRNAs that increased, implying that the TargetScan7 predictions had no more false-positive predictions than did the best experimental datasets. Because some sets of canonical biochemically supported targets performed as well as their cohort of top TargetScan7 predictions, we considered the utility of focusing on mRNAs identified by both approaches. In each comparison, the set of mRNAs that were both canonical biochemically supported targets and within the cohort of top TargetScan7 predictions tended to be more responsive. However, these intersecting subsets included much fewer mRNAs than the original sets, and when compared to an equivalent number of top TargetScan7 predictions, each intersecting set performed no better than did its cohort of top TargetScan7 predictions (Figure 6). Therefore, considering the CLIP results to restrict the top predictions to a higher-confidence set is useful but not more useful than simply implementing a more stringent computational cutoff. Likewise, taking the union of the CLIP-supported targets and the cohort of 65 predictions, rather than the intersection, did not generate a set of targets that was more responsive than an equivalent number of top TargetScan7 predictions (data not shown). The TargetScan database (v7.0) As already mentioned, we used the context++ model to rank miRNA target predictions presented in the most recent version (v7.0) of the TargetScan database (targetscan.org), thereby making our results accessible to others working on miRNAs. For simplicity, we developed the context++ model using mRNAs without abundant alternative 3′-UTR isoforms, and to make fair comparisons with the output of previous models, we tested the context++ model using only the longest RefSeq-annotated isoform. Nevertheless, considering the usage of alternative 3′-UTR isoforms, which can influence both the presence and scoring of target sites, significantly improves the performance of miRNA targeting models (Nam et al., 2014). Thus, our overhaul of the TargetScan predictions incorporated both the context++ scores and current isoform information when ranking mRNAs with canonical 7–8 nt miRNA sites in their 3′ UTRs. The resulting improvements applied to the predictions centered on human, mouse, and zebrafish 3′ UTRs (TargetScanHuman, TargetScanMouse, and TargetScanFish, respectively); and by 3′-UTR homology, to the conserved and nonconserved predictions in chimp, rhesus, rat, cow, dog, opossum, chicken, and frog; as well as to the conserved predictions in 74 other sequenced vertebrate species, thereby providing a valuable resource for placing miRNAs into gene-regulatory networks. Because the main gene-annotation databases (e.g., RefSeq and Ensembl/Gencode) are still in the process of incorporating the information available on 3′-UTR isoforms, the 66 first step in the TargetScan overhaul was to compile a set of reference 3′ UTRs that represented the longest 3′-UTR isoforms for representative ORFs of human, mouse, and zebrafish. These representative ORFs were chosen among the set of transcript annotations sharing the same stop codon, with alternative last exons generating multiple representative ORFs per gene. The human and mouse databases started with Gencode annotations (Harrow et al., 2012), for which 3′ UTRs were extended, when possible, using RefSeq annotations (Pruitt et al., 2012), recently identified long 3′-UTR isoforms (Miura et al., 2013), and 3P-seq clusters marking more distal cleavage and polyadenylation sites (Nam et al., 2014). Zebrafish reference 3′ UTRs were similarly derived in a recent 3P-seq study (Ulitsky et al., 2012). For each of these reference 3′-UTR isoforms, 3P-seq datasets were used to quantify the relative abundance of tandem isoforms, thereby generating the isoform profiles needed to score features that vary with 3′-UTR length (len_3UTR, min_dist, and off6m) and assign a weight to the context++ score of each site, which accounted for the fraction of 3′-UTR molecules containing the site (Nam et al., 2014). For each representative ORF, our new web interface depicts the 3′-UTR isoform profile and indicates how the isoforms differ from the longest Gencode annotation (Figure 7). 3P-seq data were available for seven developmental stages or tissues of zebrafish, enabling isoform profiles to be generated and predictions to be tailored for each of these. For human and mouse, however, 3P-seq data were available for only a small fraction of tissues/cell types that might be most relevant for end users, and thus results from all 3P- seq datasets available for each species were combined to generate a meta 3′-UTR isoform profile for each representative ORF. Although this approach reduces accuracy of 67 predictions involving differentially expressed tandem isoforms, it nonetheless outperforms the previous approach of not considering isoform abundance at all, presumably because isoform profiles for many genes are highly correlated in diverse cell types (Nam et al., 2014). For each 6–8mer site, we used the corresponding 3′-UTR profile to compute the context++ score and to weight this score based on the relative abundance of tandem 3′- UTR isoforms that contained the site (Nam et al., 2014). Scores for the same miRNA family were also combined to generate cumulative weighted context++ scores for the 3′- UTR profile of each representative ORF, which provided the default approach for ranking targets with at least one 7–8 nt site to that miRNA family. Effective non- canonical site types, i.e., 3′-compensatory and centered sites, were also predicted. Using either the human or mouse as a reference, predictions were also made for orthologous 3′ UTRs of other vertebrate species. As an option for tetrapod species, the user can also request predicted targets of broadly conserved miRNAs to be ranked based on their aggregate PCT scores (Friedman et al., 2009), as updated in this study. The user can also obtain predictions from the perspective of each protein-coding gene, viewed either as a table of miRNAs (ranked by either total context++ score or aggregate PCT score) or as the mapping of 7–8 nt sites (as well as non-canonical sites) shown beneath the 3′-UTR profile and above the 3′-UTR sequence alignment (Figure 7). A flowchart summarizing the TargetScan overhaul is provided (Figure 7–figure supplement 1). 68 Discussion Starting with an expanded and improved compendium of sRNA transfection datasets, we identified 14 features that each correlate with target repression and add predictive value when incorporated into a quantitative model of miRNA targeting efficacy. This model performed better than previous models and at least as well as the best high-throughput CLIP approaches. Because our model was trained on data derived from a single cell type, a potential concern was its generalizability to other cell types. Heightening this concern is the recent report of widespread dependency of miRNA-mediated repression on cellular context (Erhard et al., 2014). However, other work addressing this question shows that after accounting for the different cellular repertoires of expressed mRNAs, the target response is remarkably consistent between different cell types, with alternative usage of 3′-UTR isoforms being the predominant mechanism shaping cell-type-specific differences in miRNA targeting (Nam et al., 2014). Testing the model across diverse cell types confirmed its generalizability; it performed at least as well as the best high-throughput CLIP approaches in each of the contexts examined (Figure 6). Of course, this testing was restricted to only those targets that were expressed in each cellular context. Likewise, to achieve this highest level of performance, any future use of our model or its predictions would also require filtering of the predictions to focus on only the miRNAs and mRNAs co-expressed in the cells of interest. One of the more interesting features incorporated into the context++ model is SA (the predicted structural accessibility of the site). Freedom from occlusive mRNA structure has long been considered a site-efficacy determinant (Robins et al., 2005; 69 Ameres et al., 2007; Kertesz et al., 2007; Long et al., 2007; Tafer et al., 2008) and proposed as the underlying mechanistic explanation for the utility of other features, including global 3′-UTR AU content (Robins and Press, 2005; Hausser et al., 2009), local AU content (Grimson et al., 2007; Nielsen et al., 2007), minimum distance of the site (Grimson et al., 2007), and 3′-UTR length (Hausser et al., 2009; Betel et al., 2010; Wen et al., 2011; Reczko et al., 2012). The challenge has been to predict and score site accessibility in a way that is informative after controlling for local AU content, which is important for speaking to the importance of less occlusive secondary structure as opposed to involvement of some AU-binding activity (Grimson et al., 2007). The selection of the SA feature in all 1000 bootstrap samples of all four site types showed that it provided discriminatory power apart from that provided by local AU content and other correlated features, which reinforced the idea that the occlusive RNA structure does indeed limit site efficacy. This being said, local AU content, minimum distance of the site, and 3′-UTR length were each also selected in nearly all 1000 bootstrap samples for most site types (Table 1), which suggests that either these features were selected for reasons other than their correlation with site accessibility or the definition and scoring of our SA feature has additional room for improvement. Our ability to confidently identify additional features that each contribute to improved prediction of targeting efficacy was enhanced by our pre-processing of the experimental datasets, which minimized variation from biases unrelated to the sRNA sequence. Yet despite applying this same normalization procedure to our test set, the observed r2 value of 0.14 implied that our model explained only 14% of the variability observed among mRNAs with canonical 7–8 nt 3′-UTR sites (Figure 4B). The r2 value 70 increased to 0.15 when considering the usage of alternative 3′-UTR isoforms, but 85% of the variability remained unexplained. Error in the microarray measurements, different sRNA transfection efficiencies, variable incorporation of sRNAs into the silencing complex, and secondary effects of introducing the sRNA presumably made major contributions to the unexplained variability. Nonetheless, imperfections of the context++ model also contributed, raising the question of how much the model might be improved by identifying additional features or developing better methods for scoring and combining existing features. In analysis not described, we evaluated the utility of other types of regression (e.g., linear regression models with interaction terms, lasso/elastic net-regularized regression, multivariate adaptive regression splines, random forest, boosted regression trees, and iterative Bayesian model averaging) and found their performance to be comparable to that of stepwise regression but their resulting models to be considerably more complex and thus less interpretable. One way to evaluate the extent to which the context++ model might be improved is to consider the degree to which its performance depends on the site-conservation feature. Because sites under selective pressure preferentially possess molecular features required for efficacy, inclusion of the site-conservation feature indirectly recovers some of the information that would otherwise be lost when informative molecular features are missing or imperfectly scored. As more informative molecular features are identified and included in a model, less information remains to be captured, and thus the site- conservation feature cannot contribute as much to the performance of the model. The site-conservation feature (PCT) was chosen in all 1000 bootstrap samples of each of the three major site types, which showed that the molecular features of our model still do not 71 fully capture all the determinants under selective pressure. However, PCT was not one of the most informative features (Figure 4C). Moreover, when tested as in Figure 5B, a model trained on only site type and the other 13 molecular features performed nearly as well as the full context++ model (r2 of 0.126, compared to 0.139 for the full model). This drop in r2 of only 0.013 was substantially less than the 0.044 r2 observed for the site- conservation feature on its own (Figure 5B, TargetScan.PCT), which suggested that when predicting the response of the test-set mRNAs with the major canonical site types, the context++ model captured 70% (calculated as [0.044–0.013]/0.044) of the information potentially imparted by molecular features. The relatively minor contribution of site conservation highlights the ability of the context++ model to predict the efficacy of nonconserved sites. Although, everything else being equal, its score for a conserved site is slightly better than that for a nonconserved site, this difference does not prevent inclusion of nonconserved sites from the top predictions. Its general applicability to all canonical sites is useful for evaluating not only nonconserved sites to conserved miRNAs but also all sites for nonconserved miRNAs (e.g., Figures 6K–L), including viral miRNAs, as well as the off-targets of synthetic siRNAs and shRNAs. Our analyses show that recent computational and experimental approaches, including the different types of CLIP, all fail to identify non-canonical targets that are repressed more than control transcripts (Figure 1, Figure 5C, and Figure 5F), which reopens the question of whether more than a miniscule fraction of miRNA-mediated repression is mediated through non-canonical sites. Although CLIP approaches can identify non-canonical sites that bind the miRNA with some degree of specificity (Figure 72 2), these non-canonical binding sites do not function to mediate detectable repression. Thus far, the only functional non-canonical sites that can be predicted are 3′- compensatory sites, cleavage sites, and centered sites, which together comprise only a very small fraction (<1%) of the functional sites that can be predicted with comparable accuracy (Bartel, 2009; Shin et al., 2010). The failure of computational methods to find many functional non-canonical sites cannot rule out the possibility that many of these sites might still exist; if such sites are recognized through unimagined determinants, computational efforts might have missed them. CLIP approaches, on the other hand, provide information that is independent of proposed pairing rules or other hypothesized recognition determinants. Therefore, our analysis of the CLIP results, which detected no residual repression after accounting for canonical interactions, provide the most compelling evidence to date on this issue. Unless there is a substantial technical bias in the CLIP approach (such as a large unanticipated disparity in the propensity of non- canonical interactions to crosslink), the inability of current CLIP approaches to identify non-canonical targets that are repressed more than control transcripts argues strongly against the existence of many functional non-canonical targets. Why might the CLIP-identified non-canonical sites fail to mediate repression (Figure 1) despite binding the miRNA in vivo (Figure 2)? Perhaps these sites are ineffective because perfect seed pairing is required for repression. For example, perfect seed pairing might favor binding of a downstream effector, either directly by contributing to its binding site or indirectly through an Argonaute conformational change that favors its binding. However, this explanation is difficult to reconcile with the activity of 3′- compensatory and centered sites, which can mediate repression despite their lack of 73 perfect seed pairing (Bartel, 2009; Shin et al., 2010), and the activity of Argonaute artificially tethered to an mRNA, which can mediate repression without any pairing to the miRNA (Pillai et al., 2004; Eulalio et al., 2008). Therefore, a more plausible explanation is that the CLIP-identified non-canonical sites bind the miRNA too transiently to mediate repression. This explanation for the inefficacy of the recently identified non-canonical sites in the 3′ UTRs resembles that previously proposed for the inefficacy of most canonical sites in ORFs: In both cases the ineffective sites bind to the miRNA very transiently—the canonical sites in ORFs dissociating quickly because of displacement by the ribosome (Grimson et al., 2007; Gu et al., 2009), and the CLIP-identified non- canonical sites in 3′ UTRs dissociating quickly because they lack both seed pairing and the extensive pairing outside the seed characteristic of effective non-canonical sites (3′- compensatory and centered sites) and thus have intrinsically fast dissociation rates. The idea that newly identified non-canonical sites bind the miRNA too transiently to mediate repression raises the question of how CLIP could have identified so many of these sites in the first place; shouldn’t crosslinking be a function of site occupancy, and shouldn’t occupancy be a function of dissociation rates? Part of the answer to these questions comes with the realization that the transcriptome has many more non-canonical binding sites than canonical ones. The motifs identified in the non-canonical interactions have information contents as low as 5.6 bits, and thus are much more common in 3′ UTRs than canonical 6mer or 7mer sites (12 bits and 14 bits, respectively). This high abundance of the non-canonical binding sites would help offset the low occupancy of individual non-canonical sites, such that at any moment more than half of the bound miRNA might reside at non-canonical sites, yielding more non-canonical than canonical 74 sites when using experimental approaches with such high specificity that they can identify a site with only a single read (Figure 2–figure supplement 1A). Although the high abundance of non-canonical sites partly explains why CLIP identifies these sites in such high numbers, it cannot provide the complete answer. Some non-canonical sites in the CLASH and chimera datasets are supported by multiple reads, and all the dCLIP-identified non-canonical sites of the miR-155 study (Loeb et al., 2012) are supported by multiple reads. How could some CLIP clusters with ineffective, non- canonical sites have as much read support as some with effective, canonical sites? Our answer to this question rests on the recognition that cluster read density does not perfectly correspond to site occupancy (Friedersdorf and Keene, 2014), with the other key factors being mRNA expression levels and crosslinking efficiency. In principle, normalizing the CLIP tag numbers to the mRNA levels minimizes the first factor, preventing a low- occupancy site in a highly expressed mRNA from appearing as well supported as a high- occupancy site in a lowly expressed mRNA (Chi et al., 2009; Jaskiewicz et al., 2012). Accounting for differential crosslinking efficiencies is a far greater challenge. RNA– protein UV crosslinking is expected to be highly sensitive to the identity, geometry, and environment of the crosslinking constituents, leading to the possibility that the crosslinking efficiency of some sites is orders of magnitude greater than that of others. When considered together with the high abundance of non-canonical sites, variable crosslinking efficiency might explain why so many ineffective non-canonical sites are identified. Overlaying a wide distribution of crosslinking efficiencies onto the many thousands of ineffective, non-canonical sites could yield a substantial number of sites at the high-efficiency tail of the distribution for which the tag support matches that of 75 effective canonical sites. Similar conclusions are drawn for other types of RNA-binding interactions when comparing CLIP results with binding results (Lambert et al., 2014). Variable crosslinking efficiency also explains why many top predictions of the context++ model are missed by the CLIP methods, as indicated by the modest overlap in the CLIP identified targets and the top predictions (Figure 6). The crosslinking results are not only variable from site to site, which generates false negatives for perfectly functional sites, but they are also variable between biological replicates (Loeb et al., 2012), which imposes a challenge for assigning dCLIP clusters to a miRNA. Although this challenge is mitigated in the CLASH and chimera approaches, which provide unambiguous assignment of the miRNAs to the sites, the ligation step of these approaches occurs at low frequency and presumably introduces additional biases, as suggested by the different profile of non-canonical sites identified by the two approaches (Figure 2B and Figure 2– figure supplement 1A). For example, CLASH identifies non-canonical pairing to the 3′ region of miR-92 (Helwak et al., 2013), whereas the chimera approach identified non- canonical pairing to the 5′ region of this same miRNA (Figure 2C). Because of the false negatives and biases of the CLIP approaches, the context++ model, which has its own flaws, achieves an equal or better performance than the published CLIP studies. Our observation that CLIP-identified non-canonical sites fail to mediate repression reasserts the primacy of canonical seed pairing for miRNA-mediated gene regulation. Compared to canonical sites, effective non-canonical sites (i.e., 3′- compensatory sites and centered sites) are rare because they require many more base pairs to the miRNA (Bartel, 2009; Shin et al., 2010) and thus together make up <1% of the effective target sites predicted to date. The requirement of so much additional pairing 76 to make up for a single mismatch to the seed is proposed to arise from several sources. The advantage of propagating continuous pairing past miRNA nucleotide 8 (as occurs for centered sites) might be largely offset by the cost of an unfavorable conformational change (Bartel, 2009; Schirle et al., 2014). Likewise, the advantage of resuming pairing at the miRNA 3′ region (as occurs for 3′-compensatory sites) might be partially offset by either the relative disorder of these nucleotides (Bartel, 2009) or their unfavorable arrangement prior to seed pairing (Schirle et al., 2014). In contrast, the seed backbone is pre-organized to favor A-form pairing, with bases of nucleotides 2–5 accessible to nucleate pairing (Nakanishi et al., 2012; Schirle and MacRae, 2012). Moreover, perfect pairing propagated through miRNA nucleotide 7 creates the opportunity for favorable contacts to the minor groove of the seed:target duplex (Schirle et al., 2014). Our overhaul of the TargetScan website integrated the output of the context++ model with the most current 3′-UTR-isoform data to provide any biologist with an interest in either a miRNA or a potential miRNA target convenient access to the predictions, with an option of downloading code or bulk output suitable for more global analyses. In our continuing efforts to improve the website, several additional functionalities will also soon be provided. To facilitate the exploration of co-targeting networks involving multiple miRNAs (Tsang et al., 2010; Hausser and Zavolan, 2014), we will provide the option of ranking predictions based on the simultaneous action of several independent miRNA families, to which relative weights (e.g., accounting for relative miRNA expression levels or differential miRNA activity in a cell type of interest) can be optionally assigned. To offer predictions for transcripts not already in the TargetScan database (e.g., novel 3′ UTRs or long non-coding RNAs, including circular 77 RNAs), we will provide a mechanism to compute context++ scores interactively for a user-specified transcript. Likewise, to offer predictions for a novel sRNA sequence (e.g., off-target predictions for an siRNA), we will provide a mechanism to retrieve context++ scores interactively for a user-specified sRNA. To visualize the expression signature that results from perturbing a miRNA, we will provide a tool for the user to input mRNA/protein fold changes from high-throughput experiments and obtain a cumulative distribution plot showing the response of predicted targets relative to that of mRNAs without sites. Thus, with the current and future improvements to TargetScan, we hope to enhance the productivity of miRNA research and the understanding of this intriguing class of regulatory RNAs. Materials and Methods Microarray, RNA-seq, and RPF dataset processing A list of microarray, ribosome profiling, and proteomic datasets used for analyses, as well as the corresponding figures in which they were used, are provided (Table 2). We considered developing the model using RNA-seq data rather than microarray data, but microarray datasets were still much more plentiful and were equally suitable for measuring the effects of sRNAs. Unless pre-processed microarray data were provided by previous studies (as indicated in Table 2), raw data were processed using Bioconductor release 2.14 in the R programming language v3.1.1 (Gentleman et al., 2004; Team, 2014). Affymetrix data were first background-corrected with the “gcrma” R package (Wu et al., 2004), whereas Illumina BeadArray data from the miR-302 knockdown and miR- 522 transfection datasets (Lipchina et al., 2011; Tan et al., 2014) were processed and 78 background-corrected using the “lumiR” and “lumiExpresso” functions in the “lumi” R package (Du et al., 2008). A robust linear regression model was then used to fit to the probe intensities using the “lmFit” function (parameter “method=’robust’”) in the “limma” R package v3.6.9 (Smyth, 2004; Smyth, 2005), computing differential expression information with the provided eBayes function. Probe IDs were then converted to RefSeq or Ensembl IDs (e.g., using the hgu133plus2ENSEMBL and IlluminaID2nuID/lumiHumanAllENSEMBL functions to convert Affymetrix and BeadArray probe IDs, respectively), and the fold change for each mRNA was computed as the median fold change for all probes corresponding to the mRNA. Finally, because about half of the genes in the genome were either not expressed in the cell type examined, or were expressed at a level that was so close to the background that they were prone to have noisy fold-change measurements, the following filters were applied: i) For microarray datasets examining the effect of either knocking down either miR-92 or 25 miRNA families in HEK293 cells (Hafner et al., 2010; Helwak et al., 2013), transfecting miR-7 or miR-124 into HEK293 cells (Hausser et al., 2009), knocking out miR-155 in Th1 or Th2 cells (Rodriguez et al., 2007), or transfecting each of the 7 miRNAs in HCT116 cells (Linsley et al., 2007), we computed the mean signal for each mRNA (averaging the signal with and without the miRNA), and retained mRNAs exceeding the median of this distribution. ii) For microarray datasets examining the effect of injecting miR-430 into MZDicer embryos (Giraldez et al., 2006) or knocking out miR-155 in T cells (Loeb et al., 2012), we required the mean signal intensity of an mRNA to exceed 3.0 and 2.5, respectively. iii) For Illumina BeadArray datasets examining the effect of either knocking down miR- 79 302/367 (Lipchina et al., 2011) or transfecting miR-522 (Tan et al., 2014), we required the mean signal intensity to exceed 7.5 and 7.0, respectively. iv) For all 74 small-RNA transfections, we required mRNA expression levels to exceed 10 reads per million (RPM), as quantified by RNA-seq in mock-transfected HeLa cells (Guo et al., 2010). v) For analysis of RNA-seq or RPF datasets examining the effect of either losing Dicer in zebrafish embryos (Bazzini et al., 2012), transfecting miR-124 into HEK293, HeLa, or Huh7 cells (Nam et al., 2014), or knocking out miR-155 in B cells (Eichhorn et al., 2014), we required mRNA expression levels to exceed 10 RPM, as quantified in the condition lacking the perturbed miRNA. vi) For analysis of proteomic results, we used the pre-computed data provided in the table of significantly detectable peptides (Selbach et al., 2008). These thresholds were chosen based upon visual inspection of plots evaluating the relationship between mean expression level and fold change (commonly known as “MA plots” in the context of microarrays), attempting to balance the tradeoff between maximal sample size and reduced noise. The overall conclusions were robust to the choice of the threshold. After imposing the threshold, all fold-change values were centered by subtracting the median fold-change value of the “no-site” mRNAs in each sRNA perturbation experiment, except in the case of Figure 5–figure supplements 1B–C, in which data were mean-centered. Crosslinking and other interactome datasets When available, target genes identified using high-throughput CLIP data were collected 80 from the supplemental materials of the corresponding studies (Lipchina et al., 2011; Loeb et al., 2012; Helwak et al., 2013; Grosswendt et al., 2014). For the original PAR-CLIP study (Hafner et al., 2010), targets were inferred from an online resource of all endogenous HEK293 clusters (http://www.mirz.unibas.ch/restricted/clipdata/RESULTS/CLIP_microArray/Antago_mir _vs_ALL_AGO.txt) as well as clusters observed after transfection of either miR-7 (http://www.mirz.unibas.ch/restricted/clipdata/RESULTS/miR7_TRANSFECTION/miR 7_TRANSFECTION.html) or miR-124 (http://www.mirz.unibas.ch/restricted/clipdata/RESULTS/miR124_TRANSFECTION/mi R124_TRANSFECTION.html). For dCLIP-supported miR-124 sites identified in the original high-throughput CLIP study (Chi et al., 2009), we used clusters whose genomic coordinates were provided by S.-W. Chi (Supplementary file 3), extracting the corresponding sequences using the “getfasta” utility in BEDTools v2.20.1 (parameters “-s -name -tab ”) (Quinlan and Hall, 2010). When evaluating the function of non-canonical sites supported by CLIP or IMPACT-seq (Figures 1A–H and Figure 1–figure supplements 1–4), a cluster (or CLASH/chimera interaction) with a 6–8mer site (but not only an offset 6mer site, unless otherwise indicated in the figure legends) corresponding to the cognate miRNA was classified as harboring a canonical site. Otherwise, the cluster (or CLASH/chimera interaction) was classified as containing a non-canonical site, and the corresponding mRNA was carried forward for functional evaluation as a non- canonical CLIP-supported target if it also had no cognate 6–8mer sites (but allowing offset 6mer sites) in its 3′ UTR (using either RefSeq or Ensembl 3′-UTR annotations as appropriate for the gene IDs published by the CLIP study). When comparing the response 81 of canonical CLIP-supported targets to that of TargetScan7 predictions (Figure 6), the canonical CLIP-supported sites were additionally required to fall within (and on the same DNA strand as) annotated 3′ UTRs, as evaluated by the intersectBED utility in BEDTools v2.20.1 (parameter “-s”) (Quinlan and Hall, 2010). Motif discovery for non-canonical binding sites To identify non-canonical modes of binding, all CLASH interactions assigned to a particular miRNA family (defined as all mature miRNA sequences sharing a common sequence in nucleotide positions 2–8) were collected. Interactions containing the cognate canonical site type (offset 6mer, 6mer, 7mer-m8, 7mer-A1, or 8mer) were removed. For all miRNA families with at least 50 unique CLASH interactions remaining, enriched motifs were evaluated using MEME version 4.9.0 (parameters “-p 100 -dna -mod zoops - nmotifs 10 -minw 4 -maxw 8 -maxsize 1000000000”) (Bailey and Elkan, 1994). All motifs with an E-value < 10–3 are reported along with their E-values rounded to the nearest log-unit. Instances in which a top-ranked motif exceeded this E-value were also reported if the motif was an approximate complementary match to the miRNA. For each miRNA family, the top motif identified by MEME was aligned to a representative mature miRNA using FIMO (parameters “--norc --motif 1 --thresh 0.01”) (Grant et al., 2011), considering the reverse complement of the mature miRNA with the last nucleotide of this reverse complement changed to an A (to capture the enrichment of an adenosine across from the 5′ nucleotide of a miRNA, as occurs in 8mer and 7mer-A1 sites). Logos were also manually examined to determine if any mapped to the mature miRNA with a bulged nucleotide. The same procedure was performed for chimera interactions, for dCLIP 82 clusters reported for miR-124 and miR-155, and for IMPACT-seq clusters reported for miR-522. Microarray dataset normalization For each of the 74 transfection experiments of the compendium (Table 2), data were first partitioned into the mRNA fold changes (log2) measured in the given experiment (the response variable) as well as a matrix of the corresponding mRNA fold changes for the remaining 73 datasets (the predictor variables). A PLSR model was then trained to predict the response using information from the predictor variables. When training the model, PLSR took into account the correlated structure of the predictor matrix, decomposing it into a low-dimensional representation that maximally explained the response variable. Stating the procedure more formally, let Z be an n x m matrix consisting of log2(mRNA fold change) measurements of n mRNAs in response to the sRNA transfected in each of m experiments. Let yi represent measurements for all mRNAs in the ith experiment of Z, and Xī represent measurements for all mRNAs from all experiments except for the ith experiment in Z. Finally, let Tī be a matrix with identical dimensions as Xī, with entries tj,k = 1 if the 3′ UTR of mRNA j in Xī contains a canonical 7–8 nt match to the small RNA transfected in experiment k in Xī, and tj,k = 0 otherwise. Missing values in Z represent cases in which the mRNA signal in the microarray was too low to be reliably measured. The following algorithm was used to normalize each yi for i ∈ {1…74}: 83 i) For values in Tī in which tj,k = 1, the corresponding value xj,k in Xī was removed, which prevented the loss of signal in yi,j due to sRNA-mediated regulation of the mRNA in two independent experiments. ii) mRNAs in yi, Xī, and Tī were removed if the log2(mRNA fold change) was either undefined in yi or undefined in greater than 50% of experiments in Xī. iii) For the remaining missing values in Xī, values were imputed using the k-nearest neighbors algorithm, using k = 20, as implemented in the impute.knn function in the “impute” R package (Troyanskaya et al., 2001). Results were robust to the choice of imputation algorithm (data not shown). iv) To remove biases afflicting yi, yi was predicted from Xī using partial least squares regression, as implemented in the plsr function in the “pls” R package (Mevik and Wehrens, 2007). Ten-fold cross-validation was used to choose an appropriate number of components in the regression. Values of yi were then adjusted to their residuals as such: yi ← yi - ŷi, where ŷi was the vector of predicted values of yi from the regression. An analogous normalization procedure was performed for each of the seven transfection experiments of the test set (Supplementary File 2). RNA structure prediction 3′ UTRs were folded locally using RNAplfold (Bernhart et al., 2006), allowing the maximal span of a base pair to be 40 nucleotides, and averaging pair probabilities over an 80 nt window (parameters -L 40 -W 80), parameters found to be optimal when evaluating siRNA efficacy (Tafer et al., 2008). For each position 15 nt upstream and downstream of 84 a target site, and for 1–15 nt windows beginning at each position, the partial correlation of the log10(unpaired probability) to the log2(mRNA fold change) associated with the site was plotted, controlling for known determinants of targeting used in the context+ model, which include min_dist, local_AU, 3P_score, SPS, and TA (Garcia et al., 2011). For the final predicted structural accessibility score used as a feature, we computed the log10 of the probability that a 14-nt segment centered on the match to sRNA positions 7 and 8 was unpaired. Calculation of PCT scores We updated human PCT scores using the following datasets: i) 3′ UTRs derived from 19,800 human protein-coding genes annotated in Gencode version 19 (Harrow et al., 2012), and ii) 3′-UTR multiple sequence alignments (MSAs) across 84 vertebrate species derived from the 100-way multiz alignments in the UCSC genome browser, which used the human genome release hg19 as a reference species (Kent et al., 2002; Karolchik et al., 2014). We did not use all 100 species because, with the exception of coelacanth (a lobe- finned fish more related to the tetrapods), the fish species were excluded due to their poor quality of alignment within 3′ UTRs. Likewise, we updated the mouse scores using: i) 3′ UTRs derived from 19,699 mouse protein-coding genes annotated in Ensembl 77 (Flicek et al., 2014), and ii) 3′-UTR MSAs across 52 vertebrate species derived from the 60-way multiz alignments in the UCSC genome browser, which used the mouse genome release mm10 as a reference species (Kent et al., 2002; Karolchik et al., 2014). As before, we partitioned 3′ UTRs into ten conservation bins based upon the median branch-length score (BLS) of the reference-species nucleotides (Friedman et al., 2009). However, to 85 estimate branch lengths of the phylogenetic trees for each bin, we concatenated alignments within each bin using the “msa_view” utility in the PHAST package v1.1 (parameters “--unordered-ss --in-format SS --out-format SS --aggregate $species_list -- seqs $species_subset”, where $species_list contains the entire species tree topology and $species_subset contains the topology of the subtree spanning the placental mammals) (Siepel and Haussler, 2004). We then fit trees for each bin using the “phyloFit” utility in the PHAST package v1.1, utilizing the generalized time-reversible substitution model and a fixed-tree topology provided by UCSC (parameters “-i SS --subst-mod REV --tree $tree”, where $tree is the Newick format tree of the placental mammals) (Siepel and Haussler, 2004). PCT parameters and scores were then calculated as described, estimating the signal of conservation for each seed family relative to that of its corresponding 50 control k-mers, matched for k-mer length and rate of dinucleotide conservation at varying branch-length windows (Friedman et al., 2009). All phylogenetic trees and PCT parameters are available for download at the TargetScan website (targetscan.org). Selection of mRNAs for regression modeling The mRNAs were selected to avoid those from genes with multiple highly expressed alternative 3′-UTR isoforms, which would have otherwise obscured the accurate measurement of features such as len_3UTR or min_dist, and also created situations in which the response was diminished because some isoforms lacked the target site. HeLa 3P-seq results (Nam et al., 2014) were used to identify genes in which a dominant 3'- UTR isoform comprised ≥90% of the transcripts (Supplementary file 1). For each of these genes, the mRNA with the dominant 3′-UTR isoform was carried forward, together 86 with the ORF and 5′-UTR annotations previously chosen from RefSeq (Garcia et al., 2011). Sequences of these mRNA models are provided as Supplemental Material at http://bartellab.wi.mit.edu/publication.html. To prevent the presence of multiple 3′-UTR sites to the transfected sRNA from confounding attribution of an mRNA change to an individual site, these mRNAs were further filtered within each dataset to consider only mRNAs that contained a single 3′-UTR site (either an 8mer, 7mer-m8, 7mer-A1, or 6mer) to the cognate sRNA. Scaling the scores of each feature Features that exhibited skewed distributions, such as len_5UTR, len_ORF, and len_3UTR were log10 transformed (Table 1), which made their distributions approximately normal. These and other continuous features were then normalized to the [0, 1] interval as described [e.g., see Supplementary Figure 5 in (Garcia et al., 2011)], except a trimmed normalization was implemented to prevent outlier values from distorting the normalized distributions. For each value, the 5th percentile of the feature was subtracted from the value, and the resulting quantity was divided by the difference between the 95th and 5th percentiles of the feature. Percentile values are provided for the subset of continuous features that were scaled (Table 3). The trimmed normalization facilitated comparison of the contributions of different features to the model, with absolute values of the coefficients serving as a rough indication of their relative importance. 87 Stepwise regression and multiple linear regression models We generated 1000 bootstrap samples, each including 70% of the data from each transfection experiment of the compendium of 74 datasets (Supplementary file 1), with the remaining data reserved as a held-out test set. For each bootstrap sample, stepwise regression, as implemented in the stepAIC function from the “MASS” R package (Venables and Ripley, 2002), was used to both select the most informative combination of features and train a model. Feature selection maximized the Akaike information criterion (AIC), defined as: -2 ln(L) + 2k, where L was the likelihood of the data given the linear regression model and k was the number of features or parameters selected. The 1000 resulting models were each evaluated based on their r2 to the corresponding test set. To illustrate the utility of adding features not included in our previous models, these r2 values were compared to those obtained when re-training the multiple linear regression coefficients on each bootstrap sample using only the features of either the context-only or the context+ model, and computing r2 values on the corresponding test sets. The stepwise regression was implemented independently for each of the site types, and a final set of features was chosen as those that were selected for at least 99% of the bootstrap samples of at least two site types. Using this group of features and the entire compendium of 74 datasets as a training set, we trained a multiple linear regression model for each site type (Figure 4–Source data 1). As done previously for TargetScan6 predictions, scores for 8mer, 7mer-m8, 7mer-A1, and 6mer sites were bounded to be no greater than –0.03, – 0.02, –0.01, and 0, respectively, thereby creating a piece-wise linear function for each site type. 88 Collection and processing of previous predictions To compare predictions from different miRNA target prediction tools, we collected the following freely downloadable predictions: AnTar (predictions from either miRNA- transfection or CLIP-seq models) (Wen et al., 2011), DIANA-microT-CDS (September 2013) (Reczko et al., 2012), ElMMo v5 (January 2011) (Gaidatzis et al., 2007), MBSTAR (all predictions) (Bandyopadhyay et al., 2015), miRanda-MicroCosm v5 (Griffiths-Jones et al., 2008), miRmap v1.1 (September 2013) (Vejnar and Zdobnov, 2012), mirSVR (August 2010) (Betel et al., 2010), miRTarget2 (from miRDB v4.0, January 2012) (Wang, 2008; Wang and El Naqa, 2008), MIRZA-G (sets predicted either with or without conservation features and either with or without more stringent seed- match requirements, March 2015) (Gumienny and Zavolan, 2015), PACCMIT-CDS (sets predicted either with or without conservation features) (Marin et al., 2013), PicTar2 (from the doRina web resource; sets conserved to either fish, chicken, or mammals) (Krek et al., 2005; Anders et al., 2012), PITA Catalog v6 (3/15 flank for either “All” or “Top” predictions, August 2008) (Kertesz et al., 2007), RNA22 (May 2011) (Miranda et al., 2006), SVMicrO (Feb 2011) (Liu et al., 2010), TargetRank (all scores from web server) (Nielsen et al., 2007), TargetSpy (all predictions) (Sturm et al., 2010), TargetScan v5.2 (either conserved or all predictions, June 2011) (Grimson et al., 2007), and TargetScan v6.2 (either conserved predictions ranked by the context+ model or all predictions ranked by either the context+ model or PCT scores, June 2012) (Friedman et al., 2009; Garcia et al., 2011). For algorithms providing site-level predictions (i.e., ElMMo, MBSTAR, miRSVR, PITA, RNA22, and TargetScan), scores were summed within genes or transcripts (if available) to acquire an aggregate score. For algorithms 89 providing multiple transcript-level predictions (i.e., miRanda-MicroCosm, PACCMIT- CDS, and TargetSpy), the transcript with the best score was selected as the representative transcript isoform. In all cases, predictions with gene symbol or Ensembl ID formats were translated into RefSeq format. When computing r2 to the test sets, mRNAs that were not predicted by the algorithm to be a target were assigned the worst score in the range of all scores generated by the algorithm. 3′-UTR profiles for TargetScan7 predictions To build databases of human and mouse 3′-UTR profiles, we began with the “basic” set of protein-coding gene models deposited in Gencode v19 (human hg19 assembly) and Gencode vM3 (mouse mm10 assembly), respectively (Harrow et al., 2012). For each unique stop codon in each set of gene models, we selected the transcript with the longest 3′ UTR as its representative transcript. If other datasets indicated that the 3′ UTRs of these representative transcripts have longer tandem isoforms, we extended them accordingly, using additional annotations provided by i) the “comprehensive” set of Gencode gene models (Harrow et al., 2012), ii) all mRNAs in the RefSeq database (Pruitt et al., 2012), downloaded from the refGene database through the UCSC table browser (Kent et al., 2002), and iii) 3′-UTR extensions supported by RNA-seq evidence (Miura et al., 2013), after transforming mm9 to mm10 coordinates using liftOver (Hinrichs et al., 2006). We then used 3P-seq clusters from human and mouse (Nam et al., 2014) (again after transforming coordinates with liftOver) to further extend 3′ UTRs when possible, searching within a 5400 nt region downstream of the stop codon (excluding the regions containing annotated introns) for a cleavage and polyadenylation site supported by at 90 least one 3P-seq cluster, prohibiting the search to extend beyond the start position of any annotated downstream exon. The 5400 nt window was chosen because the 99th percentile of the lengths of previously annotated mouse and human 3′ UTRs was ~5400 nt. Zebrafish 3′ UTRs for TargetScanFish were identical to those annotated previously (Ulitsky et al., 2012). For each representative transcript, 3P-seq clusters mapping within the extended 3′ UTR were used to quantify the relative levels of alternative tandem isoforms, thereby generating a 3′-UTR profile. For human and mouse transcripts, all 3P- seq datasets for cell lines/tissues of each species were combined, after normalizing for the sequencing depth (i.e., number of uniquely mapping tags) of each dataset, to generate meta profiles. To perform this normalization, the number of tags overlapping the 3′ UTR of each annotated transcript was first summed. A matrix of summed tag counts for each cell line/tissue and for each transcript was then compiled, removing transcripts with no tags in any cell type. This matrix was then upper-quartile normalized by calculating the 75th quantile of counts in each cell type, using the calcNormFactors function (parameter “method=’upperquartile’”) in the “edgeR” R package (Robinson et al., 2010). These scaling factors were then applied to all tags, and the normalized tag counts corresponding to each 3P-seq cluster from different cell lines/tissues were summed. To accommodate cases in which the longest annotated 3′ UTR did not have tag support, a one-tag pseudocount was added to the longest tandem 3′-UTR isoform. The 3′-UTR profiles were then generated and used to compute the affected isoform ratio (AIR) and weighted context++ score for each predicted target site as depicted in Figures 2A and 3A, respectively, of Nam et al. (2014). For zebrafish transcripts, profiles were generated for each developmental stage with a 3P-seq dataset. All input and output annotation files as 91 well as scripts are available for download at TargetScan (targetscan.org). MicroRNA sets for TargetScan7 When partitioning miRNA families according to their conservation level, we began with a high-confidence set of human miRNAs supported by small-RNA sequencing (T. Tuschl, personal communication) that shared nucleotides 2–8 with a mouse miRNA supported by small-RNA sequencing (Chiang et al., 2010). We then extracted 100-way multiz alignments of each mature miRNA from the UCSC Genome Browser and counted the number of species for which nucleotides 2–8 of the miRNA did not change. As an initial pass, those conserved among ≥40 species were classified as mammalian conserved, and those conserved among >60 species were classified as more broadly conserved among vertebrate species. Due to poorer quality alignments for more distantly related species, this procedure misclassified several more broadly conserved miRNAs as mammalian conserved. Therefore, mammalian conserved miRNAs that aligned with >90% homology to a mature miRNA from chicken, frog, or zebrafish, as annotated in miRBase release 21 (Kozomara and Griffiths-Jones, 2014), were re-classified as more broadly conserved. In addition, miR-489 was included in the broadly conserved set of TargetScanHuman (but not TargetScanMouse) despite having a seed substitution in mouse. Some mammalian pri-miRNAs give rise to two or three abundant miRNA isoforms that have different seeds, either because both strands of the miRNA duplex load into Argonaute with near-equal efficiencies or because processing heterogeneity gives rise to alternative 5′ termini (Azuma-Mukai et al., 2008; Morin et al., 2008; Wu et al., 92 2009; Chiang et al., 2010). To annotate these abundant isoforms, we identified all isoforms expressed with at least 33% of reads mapping to the same start position relative to the most abundantly mapped start position on the precursor hairpin. These isoforms were carried forward as mammalian conserved isoforms if they also satisfied this property in the mouse small-RNA sequencing data (Chiang et al., 2010), and as broadly conserved isoforms if they satisfied this property in zebrafish small-RNA sequencing data available in miRBase. Adhering to the miRNA naming convention, if two isoforms mapped to the 5′ and 3′ arms of the hairpin they were named “–5p” and “–3p”, respectively, and if two isoforms were processed from the same arm they were named “.1” and “.2” in decreasing order of their abundance, as detected in the human. All mature miRNAs were downloaded from miRBase release 21 (Kozomara and Griffiths-Jones, 2014). Those that matched a conserved miRNA at nucleotides 2–8 were considered part of that miRNA family. All miRNAs and miRNA isoforms annotated in miRBase but not meeting our criteria for conservation in mammals or beyond were also grouped into families based on the identity of nucleotides 2–8 and were classified as poorly conserved miRNAs (which included many small RNAs misclassified as miRNAs). All mammalian or broadly conserved and poorly conserved miRNA seed families are available for download at TargetScan (targetscan.org). TargetScan7 predictions TargetScan (v7.0) provides the option of ranking predicted targets of mammalian miRNAs according to either cumulative weighted context++ score (CWCS), which ranks based upon the predicted repression, or aggregate PCT score of the longest 3′-UTR 93 isoform, which ranks based upon the confidence that targeting is evolutionarily conserved (Figure 7–figure supplement 1). For each predicted target, the CWCS estimated the total repression expected from multiple sites to the same miRNA. This score was calculated using the 3'-UTR profiles to weight the marginal effect of each additional site to the miRNA while also taking into account the predicted mRNA depletion resulting from any downstream sites to the same miRNA. This approach was improved over that we used previously to calculate total wContext+ scores (Nam et al., 2014), in that it did not over-estimate the aggregate effect of multiple sites in distal isoforms. For each miRNA family, 8mer, 7mer- m8, 7mer-A1, and 6mer sites were first filtered to remove overlapping sites, and for each reference 3' UTR, nonoverlapping sites to the same miRNA were numbered from 1 to n, starting at the distal end of the 3' UTR. For each site i, from 1 to n, the cumulative predicted repression at that site (Ci) was calculated as Ci = C(i–1) + (1 – 2CSi)(AIRi – C(i– 1)), in which CSi and AIRi were the context++ score and AIR of site i, and the (1 – 2CSi)(AIRi – C(i–1)) term predicted the marginal repression of site i, in which the predicted repression at the site (1 – 2CSi) was modified based on the fraction of mRNAs containing that site (AIRi) as reduced by the mRNA depletion predicted to occur from the action of any more distal sites (C(i–1), assigning C0 as 0). The CWCS was then calculated as log2(1 – Cn), in which Cn was the Ci at the most proximal site of the reference 3' UTR. For each reference 3' UTR, CWCSs were calculated for each member of a miRNA family, and the score from the member with the greatest predicted repression was chosen to represent that family, and the reference 3' UTR with the most 3P-seq tags was chosen to represent the gene. 94 When scoring features that can vary with 3′-UTR length (Min_dist, Len_3UTR, and Off6m), a weighted score was used that accounted for the abundance of each 3′-UTR tandem isoform in which the site existed, as estimated from a compendium of 3P-seq datasets from the same species (Nam et al., 2014). Although 6mer sites are used to calculate cumulative weighted context++ scores, and 6mer sites are tallied in the tables, the locations of these 6mer sites are not displayed, and targets with only 6mer sites are not listed. When calculating PCT scores, the most abundant 3′-UTR isoform as defined by 3P-seq was used to determine the conservation bin to which the 3′ UTR belonged. Sites corresponding to poorly conserved and mammalian- conserved miRNA seed families or sites overlapping annotated ORF regions were assigned PCT scores of zero. For TargetScanFish, genome-wide alignment quality in zebrafish 3′ UTRs was not of sufficient quality to compute PCT scores, so a PCT value of zero was assigned to all sites when computing context++ scores. All PCT parameters and parameters for tree branch lengths and regression models, along with pre-computed context++ scores for human, mouse, zebrafish, and other vertebrate species are available for download (targetscan.org). Perl scripts using these parameters to compute context++ scores, weighted context++ scores, CWCSs, and aggregate PCT scores are also provided (targetscan.org). Predictions are also made for homologous 3′ UTRs of other vertebrate species, using either human-centric or mouse-centric 3′-UTR definitions and corresponding MSAs. 95 Acknowledgements We thank the Bioinformatics and Research Computing group at the Whitehead Institute (I. Barrasa, B. Yuan, Y. Huang, and P. Thiru) for help implementing improvements to the TargetScan website, A. Subtelny for providing insight into positional effects of the miRNA seed, I. Ulitsky for initial help with 3P-seq analysis, R. Friedman for discussions regarding the computation of PCT parameters, T. Tuschl for sharing an unpublished list of the most frequently sequenced human miRNA isoforms, G. Agarwal for discussions regarding normalization techniques, G. Kudla for help processing the microarray data from the CLASH study, S.-W. Chi and R. B. Darnell for confirmation of the mRNAs identified as miR-124 targets in their dCLIP study, O. Rissland and J. Guo for critical reading of the manuscript, and members of the Bartel lab for helpful discussions. This work was supported by a National Science Foundation Graduate Research Fellowship (to V.A.) and an NIH grant GM067031 (to D.P.B.). D.P.B. is an investigator of the Howard Hughes Medical Institute. 96 References Ameres, S.L., Martinez, J., and Schroeder, R. (2007). Molecular basis for target RNA recognition and cleavage by human RISC. Cell 130, 101-112. Anders, G., Mackowiak, S.D., Jens, M., Maaskola, J., Kuntzagk, A., Rajewsky, N., Landthaler, M., and Dieterich, C. (2012). doRiNA: a database of RNA interactions in post-transcriptional regulation. Nucleic Acids Res 40, D180-D186. Anderson, E.M., Birmingham, A., Baskerville, S., Reynolds, A., Maksimova, E., Leake, D., Fedorov, Y., Karpilow, J., and Khvorova, A. (2008). Experimental validation of the importance of seed complement frequency to siRNA specificity. RNA 14, 853-861. Arvey, A., Larsson, E., Sander, C., Leslie, C.S., and Marks, D.S. (2010). Target mRNA abundance dilutes microRNA and siRNA activity. Mol Syst Biol 6, 363. Azuma-Mukai, A., Oguri, H., Mituyama, T., Qian, Z.R., Asai, K., Siomi, H., and Siomi, M.C. (2008). Characterization of endogenous human Argonautes and their miRNA partners in RNA silencing. Proceedings of the National Academy of Sciences of the United States of America 105, 7964-7969. Baek, D., Villen, J., Shin, C., Camargo, F.D., Gygi, S.P., and Bartel, D.P. (2008). The impact of microRNAs on protein output. Nature 455, 64-71. Bailey, T.L., and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28-36. Bandyopadhyay, S., Ghosh, D., Mitra, R., and Zhao, Z. (2015). MBSTAR: multiple instance learning for predicting specific functional binding sites in microRNA targets. Sci Rep 5, 8004. Bartel, D.P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116, 281-297. Bartel, D.P. (2009). MicroRNAs: target recognition and regulatory functions. Cell 136, 215-233. Bazzini, A.A., Lee, M.T., and Giraldez, A.J. (2012). Ribosome Profiling Shows That miR-430 Reduces Translation Before Causing mRNA Decay in Zebrafish. Science 336, 233-237. Bernhart, S.H., Hofacker, I.L., and Stadler, P.F. (2006). Local RNA base pairing probabilities in large sequences. Bioinformatics 22, 614-615. Betel, D., Koppal, A., Agius, P., Sander, C., and Leslie, C. (2010). Comprehensive modeling of microRNA targets predicts functional non-conserved and non- canonical sites. Genome Biol 11, R90. Birmingham, A., Anderson, E.M., Reynolds, A., Ilsley-Tyree, D., Leake, D., Fedorov, Y., Baskerville, S., Maksimova, E., Robinson, K., Karpilow, J., et al. (2006). 3' UTR seed matches, but not overall identity, are associated with RNAi off-targets. Nat Methods 3, 199-204. Brennecke, J., Stark, A., Russell, R.B., and Cohen, S.M. (2005). Principles of microRNA-target recognition. PLoS Biol 3, e85. Bushati, N., and Cohen, S.M. (2007). MicroRNA functions. Annual Review of Cell and Developmental Biology 23, 175-205. Chi, S.W., Hannon, G.J., and Darnell, R.B. (2012). An alternative mode of microRNA target recognition. Nat Struct Mol Biol 19, 321-327. 97 Chi, S.W., Zang, J.B., Mele, A., and Darnell, R.B. (2009). Argonaute HITS-CLIP decodes microRNA-mRNA interaction maps. Nature 460, 479-486. Chiang, H.R., Schoenfeld, L.W., Ruby, J.G., Auyeung, V.C., Spies, N., Baek, D., Johnston, W.K., Russ, C., Luo, S., Babiarz, J.E., et al. (2010). Mammalian microRNAs: experimental evaluation of novel and previously annotated genes. Genes & Development 24, 992-1009. Davis, E., Caiment, F., Tordoir, X., Cavaille, J., Ferguson-Smith, A., Cockett, N., Georges, M., and Charlier, C. (2005). RNAi-mediated allelic trans-interaction at the imprinted Rtl1/Peg11 locus. Current Biology 15, 743-749. Denzler, R., Agarwal, V., Stefano, J., Bartel, D.P., and Stoffel, M. (2014). Assessing the ceRNA Hypothesis with Quantitative Measurements of miRNA and Target Abundance. Molecular Cell 54, 766-776. Du, P., Kibbe, W.A., and Lin, S.M. (2008). lumi: a pipeline for processing Illumina microarray. Bioinformatics 24, 1547-1548. Eichhorn, S.W., Guo, H.L., McGeary, S.E., Rodriguez-Mias, R.A., Shin, C., Baek, D., Hsu, S.H., Ghoshal, K., Villen, J., and Bartel, D.P. (2014). mRNA Destabilization Is the Dominant Effect of Mammalian MicroRNAs by the Time Substantial Repression Ensues. Molecular Cell 56, 104-115. Elkon, R., and Agami, R. (2008). Removal of AU bias from microarray mRNA expression data enhances computational identification of active microRNAs. PLoS Comput Biol 4, e1000189. Erhard, F., Haas, J., Lieber, D., Malterer, G., Jaskiewicz, L., Zavolan, M., Dolken, L., and Zimmer, R. (2014). Widespread context dependency of microRNA-mediated regulation. Genome Res 24, 906-919. Eulalio, A., Huntzinger, E., and Izaurralde, E. (2008). GW182 interaction with Argonaute is essential for miRNA-mediated translational repression and mRNA decay. Nat Struct Mol Biol 15, 346-353. Farh, K.K., Grimson, A., Jan, C., Lewis, B.P., Johnston, W.K., Lim, L.P., Burge, C.B., and Bartel, D.P. (2005). The widespread impact of mammalian MicroRNAs on mRNA repression and evolution. Science 310, 1817-1821. Flicek, P., Amode, M.R., Barrell, D., Beal, K., Billis, K., Brent, S., Carvalho-Silva, D., Clapham, P., Coates, G., Fitzgerald, S., et al. (2014). Ensembl 2014. Nucleic Acids Res 42, D749-755. Friedersdorf, M.B., and Keene, J.D. (2014). Advancing the functional utility of PAR- CLIP by quantifying background binding to mRNAs and lncRNAs. Genome Biol 15, R2. Friedman, R.C., Farh, K.K., Burge, C.B., and Bartel, D.P. (2009). Most mammalian mRNAs are conserved targets of microRNAs. Genome Research 19, 92-105. Gaidatzis, D., Nimwegen, E., Hausser, J., and Zavolan, M. (2007). Inference of miRNA targets using evolutionary conservation and pathway analysis. BMC Bioinformatics 8, 248. Garcia, D.M., Baek, D., Shin, C., Bell, G.W., Grimson, A., and Bartel, D.P. (2011). Weak seed-pairing stability and high target-site abundance decrease the proficiency of lsy-6 and other microRNAs. Nat Struct Mol Biol 18, 1139-1146. Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., et al. (2004). Bioconductor: open software 98 development for computational biology and bioinformatics. Genome Biol 5, R80. Giraldez, A.J., Mishima, Y., Rihel, J., Grocock, R.J., Van Dongen, S., Inoue, K., Enright, A.J., and Schier, A.F. (2006). Zebrafish MiR-430 promotes deadenylation and clearance of maternal mRNAs. Science 312, 75-79. Grant, C.E., Bailey, T.L., and Noble, W.S. (2011). FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017-1018. Griffiths-Jones, S., Saini, H.K., van Dongen, S., and Enright, A.J. (2008). miRBase: tools for microRNA genomics. Nucleic Acids Res 36, D154-158. Grimson, A., Farh, K.K., Johnston, W.K., Garrett-Engele, P., Lim, L.P., and Bartel, D.P. (2007). MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Molecular Cell 27, 91-105. Grosswendt, S., Filipchyk, A., Manzano, M., Klironomos, F., Schilling, M., Herzog, M., Gottwein, E., and Rajewsky, N. (2014). Unambiguous Identification of miRNA:Target Site Interactions by Different Types of Ligation Reactions. Molecular Cell. Gu, S., Jin, L., Zhang, F.J., Sarnow, P., and Kay, M.A. (2009). Biological basis for restriction of microRNA targets to the 3 ' untranslated region in mammalian mRNAs. Nat Struct Mol Biol 16, 144-150. Gumienny, R., and Zavolan, M. (2015). Accurate transcriptome-wide prediction of microRNA targets and small interfering RNA off-targets with MIRZA-G. Nucleic Acids Res. Guo, H., Ingolia, N.T., Weissman, J.S., and Bartel, D.P. (2010). Mammalian microRNAs predominantly act to decrease target mRNA levels. Nature 466, 835-840. Hafner, M., Landthaler, M., Burger, L., Khorshid, M., Hausser, J., Berninger, P., Rothballer, A., Ascano, M., Jungkamp, A.C., Munschauer, M., et al. (2010). Transcriptome-wide Identification of RNA-Binding Protein and MicroRNA Target Sites by PAR-CLIP. Cell 141, 129-141. Harrow, J., Frankish, A., Gonzalez, J.M., Tapanari, E., Diekhans, M., Kokocinski, F., Aken, B.L., Barrell, D., Zadissa, A., Searle, S., et al. (2012). GENCODE: the reference human genome annotation for The ENCODE Project. Genome Research 22, 1760-1774. Hausser, J., Landthaler, M., Jaskiewicz, L., Gaidatzis, D., and Zavolan, M. (2009). Relative contribution of sequence and structure features to the mRNA binding of Argonaute/EIF2C-miRNA complexes and the degradation of miRNA targets. Genome Research 19, 2009-2020. Hausser, J., and Zavolan, M. (2014). Identification and consequences of miRNA-target interactions--beyond repression of gene expression. Nat Rev Genet 15, 599-612. Helwak, A., Kudla, G., Dudnakova, T., and Tollervey, D. (2013). Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding. Cell 153, 654-665. Hinrichs, A.S., Karolchik, D., Baertsch, R., Barber, G.P., Bejerano, G., Clawson, H., Diekhans, M., Furey, T.S., Harte, R.A., Hsu, F., et al. (2006). The UCSC Genome Browser Database: update 2006. Nucleic Acids Res 34, D590-598. Jackson, A.L., Burchard, J., Leake, D., Reynolds, A., Schelter, J., Guo, J., Johnson, J.M., Lim, L., Karpilow, J., Nichols, K., et al. (2006a). Position-specific chemical modification of siRNAs reduces "off-target'' transcript silencing. RNA 12, 1197- 99 1205. Jackson, A.L., Burchard, J., Schelter, J., Chau, B.N., Cleary, M., Lim, L., and Linsley, P.S. (2006b). Widespread siRNA "off-target" transcript silencing mediated by seed region sequence complementarity. RNA 12, 1179-1187. Jan, C.H., Friedman, R.C., Ruby, J.G., and Bartel, D.P. (2011). Formation, regulation and evolution of Caenorhabditis elegans 3'UTRs. Nature. Jaskiewicz, L., Bilen, B., Hausser, J., and Zavolan, M. (2012). Argonaute CLIP--a method to identify in vivo targets of miRNAs. Methods 58, 106-112. Jones-Rhoades, M.W., and Bartel, D.P. (2004). Computational identification of plant MicroRNAs and their targets, including a stress-induced miRNA. Molecular Cell 14, 787-799. Karginov, F.V., Cheloufi, S., Chong, M.M.W., Stark, A., Smith, A.D., and Hannon, G.J. (2010). Diverse Endonucleolytic Cleavage Sites in the Mammalian Transcriptome Depend upon MicroRNAs, Drosha, and Additional Nucleases. Molecular Cell 38, 781-788. Karolchik, D., Barber, G.P., Casper, J., Clawson, H., Cline, M.S., Diekhans, M., Dreszer, T.R., Fujita, P.A., Guruvadoo, L., Haeussler, M., et al. (2014). The UCSC Genome Browser database: 2014 update. Nucleic Acids Research 42, D764- D770. Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, D. (2002). The human genome browser at UCSC. Genome Research 12, 996-1006. Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U., and Segal, E. (2007). The role of site accessibility in microRNA target recognition. Nat Genet 39, 1278-1284. Khan, A.A., Betel, D., Miller, M.L., Sander, C., Leslie, C.S., and Marks, D.S. (2009). Transfection of small RNAs globally perturbs gene regulation by endogenous microRNAs. Nature Biotechnology 27, 549-555. Khorshid, M., Hausser, J., Zavolan, M., and van Nimwegen, E. (2013). A biophysical miRNA-mRNA interaction model infers canonical and noncanonical targets. Nat Methods 10, 253-255. Kishore, S., Jaskiewicz, L., Burger, L., Hausser, J., Khorshid, M., and Zavolan, M. (2011). A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins. Nat Methods 8, 559-564. Kloosterman, W.P., and Plasterk, R.H.A. (2006). The diverse functions of MicroRNAs in animal development and disease. Developmental Cell 11, 441-450. Kozomara, A., and Griffiths-Jones, S. (2014). miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Research 42, D68-D73. Krek, A., Grun, D., Poy, M.N., Wolf, R., Rosenberg, L., Epstein, E.J., MacMenamin, P., da Piedade, I., Gunsalus, K.C., Stoffel, M., et al. (2005). Combinatorial microRNA target predictions. Nat Genet 37, 495-500. Krutzfeldt, J., Rajewsky, N., Braich, R., Rajeev, K.G., Tuschl, T., Manoharan, M., and Stoffel, M. (2005). Silencing of microRNAs in vivo with 'antagomirs'. Nature 438, 685-689. Lal, A., Navarro, F., Maher, C.A., Maliszewski, L.E., Yan, N., O'Day, E., Chowdhury, D., Dykxhoorn, D.M., Tsai, P., Hofmann, O., et al. (2009). miR-24 Inhibits cell proliferation by targeting E2F2, MYC, and other cell-cycle genes via binding to 100 "seedless" 3'UTR microRNA recognition elements. Molecular Cell 35, 610-625. Lambert, N., Robertson, A., Jangi, M., McGeary, S., Sharp, P.A., and Burge, C.B. (2014). RNA Bind-n-Seq: quantitative assessment of the sequence and structural binding specificity of RNA binding proteins. Mol Cell 54, 887-900. Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., and Irizarry, R.A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11, 733-739. Lewis, B.P., Burge, C.B., and Bartel, D.P. (2005). Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120, 15-20. Lewis, B.P., Shih, I.H., Jones-Rhoades, M.W., Bartel, D.P., and Burge, C.B. (2003). Prediction of mammalian microRNA targets. Cell 115, 787-798. Lianoglou, S., Garg, V., Yang, J.L., Leslie, C.S., and Mayr, C. (2013). Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression. Genes & Development 27, 2380-2396. Lim, L.P., Lau, N.C., Garrett-Engele, P., Grimson, A., Schelter, J.M., Castle, J., Bartel, D.P., Linsley, P.S., and Johnson, J.M. (2005). Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature 433, 769-773. Linsley, P.S., Schelter, J., Burchard, J., Kibukawa, M., Martin, M.M., Bartz, S.R., Johnson, J.M., Cummins, J.M., Raymond, C.K., Dai, H., et al. (2007). Transcripts targeted by the microRNA-16 family cooperatively regulate cell cycle progression. Mol Cell Biol 27, 2240-2252. Lipchina, I., Elkabetz, Y., Hafner, M., Sheridan, R., Mihailovic, A., Tuschl, T., Sander, C., Studer, L., and Betel, D. (2011). Genome-wide identification of microRNA targets in human ES cells reveals a role for miR-302 in modulating BMP response. Genes & Development 25, 2173-2186. Liu, H., Yue, D., Chen, Y., Gao, S.J., and Huang, Y. (2010). Improving performance of mammalian microRNA target prediction. BMC Bioinformatics 11, 476. Loeb, G.B., Khan, A.A., Canner, D., Hiatt, J.B., Shendure, J., Darnell, R.B., Leslie, C.S., and Rudensky, A.Y. (2012). Transcriptome-wide miR-155 Binding Map Reveals Widespread Noncanonical MicroRNA Targeting. Molecular Cell 48, 760-770. Long, D., Lee, R., Williams, P., Chan, C.Y., Ambros, V., and Ding, Y. (2007). Potent effect of target structure on microRNA function. Nat Struct Mol Biol 14, 287-294. Majoros, W.H., Lekprasert, P., Mukherjee, N., Skalsky, R.L., Corcoran, D.L., Cullen, B.R., and Ohler, U. (2013). MicroRNA target site identification by integrating sequence and binding information. Nat Methods 10, 630-633. Marin, R.M., Sulc, M., and Vanicek, J. (2013). Searching the coding region for microRNA targets. RNA 19, 467-474. Mayr, C., and Bartel, D.P. (2009). Widespread shortening of 3'UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell 138, 673- 684. Mevik, B.H., and Wehrens, R. (2007). The pls package: Principal component and partial least squares regression in R. Journal of Statistical Software 18. Miranda, K.C., Huynh, T., Tay, Y., Ang, Y.S., Tam, W.L., Thomson, A.M., Lim, B., and 101 Rigoutsos, I. (2006). A pattern-based method for the identification of microRNA binding sites and their corresponding heteroduplexes. Cell 126, 1203-1217. Miura, P., Shenker, S., Andreu-Agullo, C., Westholm, J.O., and Lai, E.C. (2013). Widespread and extensive lengthening of 3' UTRs in the mammalian brain. Genome Res 23, 812-825. Morin, R.D., O'Connor, M.D., Griffith, M., Kuchenbauer, F., Delaney, A., Prabhu, A.L., Zhao, Y., McDonald, H., Zeng, T., Hirst, M., et al. (2008). Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res 18, 610-621. Nakanishi, K., Weinberg, D.E., Bartel, D.P., and Patel, D.J. (2012). Structure of yeast Argonaute with guide RNA. Nature 486, 368-374. Nam, J.W., Rissland, O.S., Koppstein, D., Abreu-Goodger, C., Jan, C.H., Agarwal, V., Yildirim, M.A., Rodriguez, A., and Bartel, D.P. (2014). Global Analyses of the Effect of Different Cellular Contexts on MicroRNA Targeting. Molecular Cell 53, 1031-1043. Nielsen, C.B., Shomron, N., Sandberg, R., Hornstein, E., Kitzman, J., and Burge, C.B. (2007). Determinants of targeting by endogenous and exogenous microRNAs and siRNAs. RNA 13, 1894-1910. Pillai, R.S., Artus, C.G., and Filipowicz, W. (2004). Tethering of human Ago proteins to mRNA mimics the miRNA-mediated repression of protein synthesis. RNA 10, 1518-1525. Pruitt, K.D., Tatusova, T., Brown, G.R., and Maglott, D.R. (2012). NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res 40, D130-135. Quinlan, A.R., and Hall, I.M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-842. Reczko, M., Maragkakis, M., Alexiou, P., Grosse, I., and Hatzigeorgiou, A.G. (2012). Functional microRNA targets in protein coding sequences. Bioinformatics 28, 771-776. Reinhart, B.J., Slack, F.J., Basson, M., Pasquinelli, A.E., Bettinger, J.C., Rougvie, A.E., Horvitz, H.R., and Ruvkun, G. (2000). The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature 403, 901-906. Robins, H., Li, Y., and Padgett, R.W. (2005). Incorporating structure to predict microRNA targets. Proc Natl Acad Sci USA 102, 4006-4009. Robins, H., and Press, W.H. (2005). Human microRNAs target a functionally distinct population of genes with AT-rich 3' UTRs. Proc Natl Acad Sci USA 102, 15557- 15562. Robinson, M.D., McCarthy, D.J., and Smyth, G.K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140. Rodriguez, A., Vigorito, E., Clare, S., Warren, M.V., Couttet, P., Soond, D.R., van Dongen, S., Grocock, R.J., Das, P.P., Miska, E.A., et al. (2007). Requirement of bic/microRNA-155 for normal immune function. Science 316, 608-611. Saito, T., and Satrom, P. (2012). Target gene expression levels and competition between transfected and endogenous microRNAs are strong confounding factors in microRNA high-throughput experiments. Silence 3, 3. 102 Sandberg, R., Neilson, J.R., Sarma, A., Sharp, P.A., and Burge, C.B. (2008). Proliferating cells express mRNAs with shortened 3' untranslated regions and fewer microRNA target sites. Science 320, 1643-1647. Schirle, N.T., and MacRae, I.J. (2012). The crystal structure of human Argonaute2. Science 336, 1037-1040. Schirle, N.T., Sheu-Gruttadauria, J., and MacRae, I.J. (2014). Structural basis for microRNA targeting. Science 346, 608-613. Schwarz, D.S., Ding, H.L., Kennington, L., Moore, J.T., Schelter, J., Burchard, J., Linsley, P.S., Aronin, N., Xu, Z.S., and Zamore, P.D. (2006). Designing siRNA that distinguish between genes that differ by a single nucleotide. PLoS Genetics 2, 1307-1318. Selbach, M., Schwanhausser, B., Thierfelder, N., Fang, Z., Khanin, R., and Rajewsky, N. (2008). Widespread changes in protein synthesis induced by microRNAs. Nature 455, 58-63. Shin, C., Nam, J.W., Farh, K.K.H., Chiang, H.R., Shkumatava, A., and Bartel, D.P. (2010). Expanding the MicroRNA Targeting Code: Functional Sites with Centered Pairing. Molecular Cell 38, 789-802. Siepel, A., and Haussler, D. (2004). Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol 21, 468-488. Smyth, G.K. (2004). Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3, Article3. Smyth, G.K. (2005). Limma: linear models for microarray data. In Bioinformatics and computational biology solutions using R and Bioconductor (Springer), pp. 397- 420. Stefani, G., and Slack, F.J. (2008). Small non-coding RNAs in animal development. Nature Reviews Molecular Cell Biology 9, 219-230. Sturm, M., Hackenberg, M., Langenberger, D., and Frishman, D. (2010). TargetSpy: a supervised machine learning approach for microRNA target prediction. BMC Bioinformatics 11. Tafer, H., Ameres, S.L., Obernosterer, G., Gebeshuber, C.A., Schroeder, R., Martinez, J., and Hofacker, I.L. (2008). The impact of target site accessibility on the design of effective siRNAs. Nature Biotechnology 26, 578-583. Tan, S.M., Kirchner, R., Jin, J., Hofmann, O., McReynolds, L., Hide, W., and Lieberman, J. (2014). Sequencing of Captive Target Transcripts Identifies the Network of Regulated Genes and Functions of Primate-Specific miR-522. Cell Reports 8, 1225-1239. Team, R.C. (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing Vienna, Austria. Tian, B., Hu, J., Zhang, H., and Lutz, C.S. (2005). A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res 33, 201-212. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R.B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520-525. Tsang, J.S., Ebert, M.S., and van Oudenaarden, A. (2010). Genome-wide Dissection of MicroRNA Functions and Cotargeting Networks Using Gene Set Signatures. 103 Molecular Cell 38, 140-153. Ulitsky, I., Shkumatava, A., Jan, C.H., Subtelny, A.O., Koppstein, D., Bell, G.W., Sive, H., and Bartel, D.P. (2012). Extensive alternative polyadenylation during zebrafish development. Genome Res 22, 2054-2066. Vejnar, C.E., and Zdobnov, E.M. (2012). MiRmap: comprehensive prediction of microRNA target repression strength. Nucleic Acids Res 40, 11673-11683. Venables, W.N., and Ripley, B.D. (2002). Modern applied statistics with S, 4th edn (New York: Springer). Wang, X. (2014). Composition of seed sequence is a major determinant of microRNA targeting patterns. Bioinformatics 30, 1377-1383. Wang, X.W. (2008). miRDB: A microRNA target prediction and functional annotation database with a wiki interface. RNA 14, 1012-1017. Wang, X.W., and El Naqa, I.M. (2008). Prediction of both conserved and nonconserved microRNA targets in animals. Bioinformatics 24, 325-332. Wen, J., Parker, B.J., Jacobsen, A., and Krogh, A. (2011). MicroRNA transfection and AGO-bound CLIP-seq data sets reveal distinct determinants of miRNA action. RNA 17, 820-834. Wu, H., Ye, C., Ramirez, D., and Manjunath, N. (2009). Alternative processing of primary microRNA transcripts by Drosha generates 5' end variation of mature microRNA. PLoS One 4, e7566. Wu, Z.J., Irizarry, R.A., Gentleman, R., Martinez-Murillo, F., and Spencer, F. (2004). A model-based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association 99, 909-917. Yekta, S., Shih, I.H., and Bartel, D.P. (2004). MicroRNA-directed cleavage of HOXB8 mRNA. Science 304, 594-596. 104 Figures and figure legends Figure 1. Inefficacy of recently reported non-canonical sites. (A) Response of mRNAs to the loss of miRNAs, comparing mRNAs that contain either a canonical or nucleation-bulge site to miR-430 to those that do not contain a miR-430 site. Plotted are cumulative distributions of mRNA fold changes observed when comparing embryos that lack miRNAs (MZDicer) to those that have miRNAs (WT), focusing on mRNAs possessing a single site of the indicated type in their 3′ UTR. Similarity of site- containing distributions to the no-site distribution was tested [one-sided Kolmogorov– Smirnov (K–S) test, P values]; the number of mRNAs analyzed in each category is listed in parentheses. See also Figure 1–figure supplement 1C and Figure 1–figure supplement 4A. (B and C) Response of mRNAs to the loss of miR-155, focusing on mRNAs that contain either a single canonical or ≥1 CLIP-supported non-canonical site to miR-155. These panels are as in (A), but compare fold changes for mRNAs with the indicated site type following genetic ablation of mir-155 in either T cells (B) or Th1 cells (C). See also Figure 1–figure supplement 2 and Figure 1–figure supplement 4B. (D and E) Response of mRNAs to the knockdown of miR-92, focusing on mRNAs that contain either a single canonical or ≥1 CLASH-identified non-canonical site to miR-92. These panels are as in (A), except CLASH-supported non-canonical sites were the same as those defined previously (Helwak et al., 2013) and thus were permitted to reside in any region of the mature mRNA, and these panels compare fold changes for mRNAs with the indicated site type following either knockdown of miR-92 (D) or combined knockdown of miR-92 and 24 other miRNAs (E) in HEK293 cells. See also Figure 1–figure 105 supplements 3A–B. (F) As in (D), but focusing on mRNAs that contain ≥1 chimera-identified site. See also Figure 1–figure supplements 3C–E. (G) Response of mRNAs to the transfection of 16 miRNAs, focusing on mRNAs that contain either a canonical or MIRZA-predicted non-canonical site. This panel is as in (A), but compares the fold changes for mRNAs with the indicated site type after introducing miRNAs, aggregating results from 16 individual transfection datasets. Fold changes are plotted for the top 100 non-canonical predictions for each of 16 miRNAs compiled either before (MIRZA, top 100) or after (MIRZA, no 6mers) removing mRNAs containing 6mer or offset 6mer 3′-UTR sites. (H) Response of mRNAs to a transfection of miR-522, focusing on mRNAs that contain either a single canonical or ≥1 IMPACT-seq-supported non-canonical site to miR-522. These panels are as in (A), except IMPACT-seq-supported non-canonical sites were the same as those defined previously (Tan et al., 2014) and thus were permitted in any region of the mature mRNA. Figure 1–figure supplement 1. Inefficacy of nucleation-bulge sites. (A and B) These panels are as in Figure 1A but compare the response of cognate site- containing mRNAs in a compendium of either 11 miRNA transfection datasets (A) or 74 sRNA transfection datasets (B). The datasets were pre-processed (Figure 3) and are provided in Supplementary file 1. (C) This panel is as in Figure 1A but compares the response of mRNAs in MZDicer embryos in which miR-430 has been injected. (D–F) This panel is as in Figure 1A but compares the response of mRNAs with the 106 indicated miR-124 site types after transfecting miR-124 into either HEK293 cells (D), HeLa cells (E), or Huh7 cells (F). Figure 1–figure supplement 2. Inefficacy of CLIP-supported non-canonical miR-155 sites. (A and B) These panels are as in Figure 1B but compare the response of mRNAs after genetic ablation of miR-155 in Type 2 helper T cells (Th2, A) or B cells (B). Figure 1–figure supplement 3. Inefficacy of CLASH- and chimera-supported non- canonical sites. (A–D) These panels are as in Figure 1D but compare the response of mRNAs with sites cognate to any one of four miRNA families (miR-15/16, miR-19, miR-17/20/93/106, or miR-25/92), for either all CLASH-supported targets (A), mRNAs with CLASH- supported 3′-UTR sites (B), all chimera-supported targets (C), or mRNAs with chimera- supported 3′-UTR sites (D). These four miRNA families were chosen because their predicted targets were the most responsive to knockdown of the 25 miRNAs. P values reflect the median P value (as evaluated by a K–S test) across 100 trials in which a no- site control cohort with matched 3′-UTR lengths was chosen for each site-containing distribution. Length-matched no-site controls were required for this analysis because longer 3′ UTRs had a greater chance of containing additional sites to at least one of the many miRNAs that were knocked down, and thus had a greater chance of being derepressed as a result of interactions otherwise not considered in the analysis. To populate each control cohort, 500 different no-site mRNAs were chosen, considering the 107 3′-UTR length of each site-containing mRNA and selecting (without replacement) control mRNAs from among the 10 no-site mRNAs with the most similar 3′-UTR lengths. Shown is the response of a control cohort for mRNAs containing non-canonical sites. mRNAs with 3′ UTRs >2000 nt were excluded from the analysis because so many of the 3′ UTRs >2000 nt had a site to at least one of the four miRNA families, making it impossible to select appropriate length-matched controls. (E) This panel is as in Figure 1F but compares the response of mRNAs with the indicated miR-302 site types after knocking down miR-302/367 in hESCs. Figure 1–figure supplement 4. Inefficacy of non-canonical sites in mediating translational repression. (A) This panel is as in Figure 1A but compares the response of mRNAs using ribosome footprint profiling (Bazzini et al., 2012), which captures changes in both mRNA stability and translational efficiency through the high-throughput sequencing of ribosome- protected mRNA fragments (RPFs). (B) This panel is as in Figure 1–figure supplement 2B but compares fold changes in RPFs after genetic ablation of miR-155 in in B cells. (C) This panel is as in (B) but compares protein fold changes for chimera-supported targets, as evaluated by pulsed SILAC (Selbach et al., 2008) after transfection of miR- 155 in HeLa cells. Figure 1–figure supplement 5. Re-evaluating conservation of chimera-supported non- canonical sites. 108 (A) Conservation of chimera-supported non-canonical sites detected in an analysis modeled after that of Grosswendt et al. (2014) but modified to control for background conservation. Plotted for the indicated miRNAs is the average conservation of chimera- supported non-canonical sites, as measured by branch-length score (BLS), compared to the average conservation of 100 equally sized cohorts of controls; error bars, standard deviation of cohort averages; **, P < 0.01; *, P < 0.05, one-sided Z test. We considered chimera-supported non-canonical sites that mapped within 3′ UTRs and contained a single mismatch to the 6 nt seed of the miRNA. This set of sites mirrored that analyzed previously (Grosswendt et al., 2014), and excluded offset 6mers, which as a class was already known to mediate repression and exhibit preferential conservation (Friedman et al., 2009). Cohorts of control sites were generated such that for each chimera-supported site, each control cohort contained a single example of the identical 6 nt motif that was present in the indicated region (either an AGO cluster or 3′ UTR) but not supported by chimeric reads. To control for local background conservation and thereby avoid treating sites within slowly evolving 3′ UTRs the same as those within rapidly evolving 3′ UTRs, we used the binning procedure developed for calculating PCT scores (Friedman et al., 2009); 3′ UTRs were partitioned into 10 conservation bins (based on the median BLS of the nucleotides of the human sequence), and control sites were randomly selected (with replacement) from 3′ UTRs in the same bin as the actual site. Control AGO clusters were collected as was done previously (Grosswendt et al., 2014), using genome-wide data downloaded from clipz.unibas.ch and derived from multiple AGO PAR-CLIP experiments performed in HEK293 cells (Kishore et al., 2011). The union of AGO clusters for all experiments was computed and filtered for overlap with Ensembl- 109 annotated 3′ UTRs, using the “merge” and “intersectBED” utilities, respectively, found in BEDTools v2.20.1 (parameter “-s”) (Quinlan and Hall, 2010). (B) Attribution of the conservation signal to the confounding effects of conserved regions. Considered are 1443 non-canonical chimera-supported sites selected as in (A) but including sites of all miRNA families. For each chimera-supported site, a z score was generated using the distribution of BLSs for 100 control sites chosen as in panel (A) from either AGO clusters or 3′ UTRs, as indicated. Each z score reflected how the conservation of the actual site differed from that of its controls. Compared are cumulative distributions of the z scores for sites of broadly conserved miRNAs and those of less conserved miRNAs. If the chimera-supported non-canonical sites were preferentially conserved because of their function in mediating repression, then sites of broadly conserved miRNAs would be expected to have a right-shifted distribution compared to sites of less conserved miRNAs, as explained in the next paragraph. However, no significant difference was discerned between each pair of z-score distributions. One way to reconcile the conservation signal observed in panel (A) with our conclusion that a large majority if not all of these sites bind miRNA but do not mediate repression is to consider the potentially confounding biochemical properties of conserved regions, which are illustrated by the observation that artificial siRNAs preferentially target sites that are evolutionarily conserved over those that are not (Nielsen et al., 2007). Because these siRNAs are not natural (and do not share a seed with conserved miRNAs) the evolutionary conservation of these preferred sites could not have arisen because they function to mediate sRNA-guided repression. Instead, some other function of these 3′- 110 UTR regions, such as greater accessibility to RNA-binding factors, must explain their preferential conservation and also endow them with properties that favor sRNA binding (Nielsen et al., 2007). To examine whether confounding properties of conserved 3′-UTR regions might similarly explain the elevated conservation of chimera-supported sites, we compared the z scores for sites bound by broadly conserved miRNAs (miRNAs in families conserved beyond mammals, as listed in TargetScan7) with those bound by less conserved miRNAs. MicroRNAs conserved among mammals but not more broadly were grouped with the less conserved miRNAs because canonical 6mer and 7mer sites to these miRNAs have no conservation signal above background, presumably because these miRNAs have not been present long enough for the number of preferentially conserved 6mer and 7mer sites to rise above the background (Friedman et al., 2009). We reasoned that the same would be true of non-canonical sites, to the extent that any are preferentially conserved. If the conservation signal observed in panel (A) were related to miRNA binding, we would have expected a difference between the scores for the sites of broadly and less conserved miRNAs. The lack of a significant difference supports the idea that chimera-supported non-canonical sites tend to be conserved for the same reason that functional sites to artificial siRNAs tend to be conserved. Figure 2. Confirmation of experimentally identified non-canonical miRNA binding sites. (A) Sequence logos corresponding to motifs enriched in dCLIP clusters that either appear following transfection of miR-124 into HeLa cells (Chi et al., 2009) (top) or disappear following knockout of miR-155 in T cells (Loeb et al., 2012) (bottom). Shown to the right of each logo is its E-value among clusters lacking a seed-matched or offset 6mer 111 canonical site and the fraction of these clusters that matched the logo. Shown below each logo are the complementary regions of the cognate miRNA family, highlighting nucleotides 2–8 in capital letters. (B) Position of the top-ranked motif corresponding to non-canonical sites enriched in CLASH (Helwak et al., 2013) (left) or chimera (Grosswendt et al., 2014) (right) data for each human miRNA family supported by at least 50 interactions without a seed-matched or offset 6mer canonical site. For each family the most enriched logo was aligned to the reverse complement of the miRNA. In cases in which a logo mapped to multiple positions along the miRNA, the positions with the best and second best scores are indicated (red and blue, respectively). (C) Sequence logos of motifs enriched in chimera interactions that lack canonical sites. As in (A), but displaying sequence logos identified in the chimera data of part (B) for a sample of nine human miRNAs. Logos identified for the other human miRNAs are also provided (Figure 2–figure supplement 1B). A nucleotide that differs between miRNA family members is indicated as a black “n”. Figure 2–figure supplement 1. Comparison of CLASH and chimera data and identification of motifs enriched in human chimera interactions that lack canonical sites. (A) Comparison of CLASH (left) and chimera (right) reads from human cells, showing the proportion possessing a canonical site (blue) and overlapping 3′ UTRs (red). In total, 18,514 CLASH and 10,567 chimera interactions were analyzed. (B) Sequence logos of motifs enriched in chimera interactions that lack canonical sites. This panel is as in Figure 2C but displays the remaining motifs identified from the 112 chimera data analyzed in Figure 2B. In cases of alignment ambiguity, both alignments are shown below the logo. For some miRNA families, multiple motifs were significantly enriched (E ≤ 0.001) and are shown separately. Significantly enriched motifs (or a top- ranked motif matching the miRNA) were not found for miR-21, and miR-3168 was excluded from the analysis due to poor support for its authenticity as a miRNA. (C) Sequence logos of motifs that do not match the cognate miRNA but are nonetheless enriched in miR-124 dCLIP (Chi et al., 2009) and miR-522 IMPACT-seq (Tan et al., 2014) clusters that lack canonical sites to the miRNA. The miR-124 logo was nearly identical to a non-specific motif previously identified as enriched in CLIP data from the mouse brain (Chi et al., 2012). The miR-522 logo was found instead of the previously reported miRNA-matching logo (Tan et al., 2014). Figure 2–figure supplement 2. Identification of motifs enriched in mouse and nematode chimera interactions that lack canonical sites. (A) Sequence logos of motifs enriched in M. musculus chimera interactions that lack canonical sites; otherwise as in Figure 2C. Significantly enriched motifs (or a top-ranked motif matching the miRNA) were not found for let-7 and miR-142-3p. (B) Sequence logos of motifs enriched in C. elegans chimera interactions that lack canonical sites; otherwise as in Figure 2C. Significantly enriched motifs (or a top-ranked motif matching the miRNA) were not found for miR-1. Figure 3. Pre-processing the microarray datasets to minimize nonspecific effects and technical biases. 113 (A) Example of the correlated response of mRNAs after transfecting two unrelated sRNAs (sRNA 1 and 2, respectively). Results for mRNAs containing at least one canonical 7–8 nt 3′-UTR site for either sRNA 1, sRNA 2, or both sRNAs are highlighted in red, blue, and green, respectively. Values for mRNAs without such sites are in grey. All mRNAs were used to calculate the Spearman correlation (rs). (B) Correlated responses observed in a compendium of 74 transfection experiments from six studies (colored as indicted in the publications list). For each pair of experiments, the rs value was calculated as in panel (A), colored as indicated in the key, and used for hierarchical clustering. (C) Study-dependent relationships between the responses of mRNAs to the transfected sRNA and either 3′-UTR length or 3′-UTR AU content, focusing on mRNAs without a canonical 7–8 nt 3′-UTR site to the sRNA. Boxplots indicate the median rs (bar), 25th and 75th percentiles (box), and the minimum of either 1.5 times the interquartile range or the most extreme data point (whiskers), with the width of the box proportional to the number of datasets used from each study. The studies are colored as in panel (B), abbreviating the first author and year. (D) Reduced correlation between the responses of mRNAs to unrelated sRNAs after applying the PLSR technique. This panel is as in (A) but plots the normalized mRNA fold changes. (E) Reduced correlations in results of the compendium experiments after applying the PLSR technique. This panel is as in (B) but plots the correlations after normalizing the mRNA fold changes. (F) Reduced study-dependent relationships between mRNA responses and either 3′-UTR 114 length or 3′-UTR AU content. This panel is as in (C) but plots the correlations after normalizing the mRNA fold changes. (G and H) Cumulative distributions of fold changes for mRNAs containing at least one canonical 7–8 nt 3′-UTR site or no site either before normalization (raw) or after normalization (normalized). Panel (G) plots the results from experiments shown in (A) and (D), and (H) plots results from all 74 datasets. Figure 3–figure supplement 1. Reduced biases from derepression of endogenous miRNA targets. (A) Pie chart reflecting the relative proportions of reads for the indicated miRNA families observed when sequencing small RNAs from HeLa cells. Relative miRNA levels were quantified as described previously (Denzler et al., 2014). (B and C) Cumulative distributions of fold changes for mRNAs with at least one canonical 7–8 nt 3′-UTR site to the indicated miRNA family in the compendium of 74 sRNA transfection datasets, either before (B) or after (C) normalization. P values were computed using a one-sided Wilcoxon rank-sum test, comparing each of the site- containing distributions to the no-site distribution. This test was a more stringent alternative to the K–S test, which led to highly significant P values for very slight differences, due to the large number of mRNAs in each distribution. To account for multiple hypotheses, an appropriate Bonferroni-corrected significance threshold would be P < 0.005, which was not achieved for any comparison in panel (C). Figure 4. Developing a regression model to predict miRNA targeting efficacy. 115 (A) Optimizing the scoring of predicted structural accessibility. Predicted RNA structural accessibility scores were computed for variable-length windows within the region centered on each canonical 7–8 nt 3′-UTR site. The heatmap displays the partial correlations between these values and the repression associated with the corresponding sites, determined while controlling for local AU content and other features of the context+ model (Garcia et al., 2011). (B) Performance of the models generated using stepwise regression compared to that of either the context-only or context+ models. Shown are boxplots of r2 values for each of the models across all 1000 sampled test sets, for mRNAs possessing a single site of the indicated type. For each site type, all groups significantly differ (P < 10-15, paired Wilcoxon sign-rank test). Boxplots are as in Figure 3C. (C) The contributions of site type and each of the 14 features of the context++ model. For each site type, the coefficients for the multiple linear regression are plotted for each feature. Because features are each scored on a similar scale, the relative contribution of each feature in discriminating between more or less effective sites is roughly proportional to the absolute value of its coefficient. Also plotted are the intercepts, which roughly indicate the discriminatory power of site type. Dashed bars indicate the 95% confidence intervals of each coefficient. 116 Figure 4–Source data 1. Coefficients of the trained context++ model corresponding to each site type. Using these coefficients and corresponding scaling factors (Table 3), context++ scores can be computed essentially as illustrated in Supplementary Figure 5 of Garcia et al. (2011). Feature 8mer 7mer-m8 7mer-A1 6mer (Intercept) –0.589 –0.224 –0.195 –0.079 TA_3UTR 0.222 0.139 0.117 0.058 SPS 0.210 0.135 0.095 0.035 sRNA1A –0.018 0.010 –0.025 –0.002 sRNA1C –0.021 0.014 –0.021 0.004 sRNA1G 0.060 0.062 0.030 0.018 sRNA8A 0.022 0.004 –0.049 –0.015 sRNA8C 0.012 –0.031 0.033 0.016 sRNA8G 0.015 –0.008 –0.017 0.006 Site8A N/A N/A 0.000 –0.002 Site8C N/A N/A 0.036 0.015 Site8G N/A N/A 0.015 0.012 Local_AU –0.254 –0.177 –0.075 –0.040 3P_score –0.040 –0.055 –0.060 –0.024 SA –0.115 –0.134 –0.077 –0.028 Min_dist 0.118 0.056 0.045 0.036 Len_ORF 0.205 0.100 0.063 0.029 Len_3UTR 0.310 0.154 0.129 0.045 Off6m –0.020 –0.011 –0.020 –0.010 ORF8m –0.118 –0.044 –0.058 –0.060 PCT –0.103 –0.048 –0.048 0.005 117 Figure 5. Performance of target prediction algorithms on a test set of seven experiments in which miRNAs were individually transfected into HCT116 cells. (A) Average number of targets predicted by the indicated algorithm for each of the seven miRNAs in the test set. The numbers of predictions with at least one canonical 7–8 nt 3′- UTR site to the transfected miRNA (dark blue) are distinguished from the remaining predictions (light blue). Names of algorithms are colored according to whether they consider only sequence or thermodynamic features of site pairing (grey), only site conservation (orange), pairing and contextual features of a site (red), or pairing, contextual features, and site conservation (purple). The most recently updated predictions were downloaded, with year that those predictions were released indicated in parentheses. (B and C) Extent to which the predictions explain the mRNA fold changes observed in the test set. For predictions tallied in panel (A), the explanatory power, as evaluated by the r2 value for the relationship between the scores of the predictions and the observed mRNA fold changes in the test set, is plotted for either mRNAs with 3′ UTRs containing at least one canonical 7–8 nt 3′-UTR site (B) or other mRNAs (C). Algorithms designed to evaluate only targets with seed-matched 7–8 nt 3′-UTR sites are labeled “N/A” in (C). (D) Repression of the top predictions of the context++ model and of our previous two models, focusing on an average of 16 top predicted targets per miRNA in the test set. The dotted lines indicate the median fold-change value for each distribution, otherwise as in Figure 1A. (E and F) Median mRNA fold changes observed in the test set for top-ranked predicted targets, considering either all predictions (E) or only those with 3′ UTRs lacking at least one canonical 7–8 nt site (F). For each algorithm listed in panel (A), all reported 118 predictions for the seven miRNAs were ranked according to their scores, and the indicated sliding threshold of top predictions was implemented. For example, at the threshold of 4, the 28 predictions with the top scores were identified (an average of 4 predictions per miRNA, allowing miRNAs with more top scores to contribute more predictions), mRNA fold-change values from the cognate transfections were collected, and the median value was plotted. When the threshold exceeded the number of reported predictions, no value was plotted. Also plotted is the median mRNA fold change for all mRNAs with at least one cognate canonical 7–8 nt site in their 3′ UTR (dashed line; an average of 1366 mRNAs per miRNA), the median fold change for all mRNAs with at least one conserved cognate canonical 7–8 nt site in their 3′ UTR (dotted line; an average of 461 mRNAs per miRNA), and the 95% interval for the median fold change of randomly selected mRNAs, determined using 1000 resamplings (without replacement) at each cutoff (shading). Conserved sites were defined as in TargetScan6, with conservation cutoffs for each site type set at different branch-length scores (cutoffs of 0.8, 1.3, and 1.6 for 8mer, 7mer-m8, and 7mer-A1 sites, respectively). Figure 5–figure supplement 1. Performance of miRNA prediction algorithms on the test set. (A) This panel is as in Figure 5D, but shows the results for all algorithms evaluated in Figure 5A. Algorithm names are listed in the order of the median fold change for their top predictions, with each name colored using the color used for its cumulative distribution. (B and C) These panels are as in Figures 5E–F, respectively, but compare mean fold 119 changes instead of median fold changes. Figure 6. Response of predictions and mRNAs with experimentally supported canonical binding sites. (A–E) Comparison of the top TargetScan7 predicted targets to mRNAs with canonical sites identified from dCLIP in either HeLa cells with and without transfected miR-124 (Chi et al., 2009) or T cells with and without miR-155 (Loeb et al., 2012). Plotted are cumulative distributions of mRNA fold changes after transfection of miR-124 in HeLa cells (A), or after genetic ablation of miR-155 in either T cells (B), Th1 cells (C), Th2 cells (D), and B cells (E) (one-sided K–S test, P values). For genes with alternative last exons, the analysis considered the score of the most abundant alternative last exon, as assessed by 3P-seq tags (as is the default for TargetScan7 when ranking predictions). Each dCLIP- identified mRNA was required to have a 3′-UTR CLIP cluster with at least one canonical site to the cognate miRNA (including 6mers but not offset 6mers). Each intersection mRNA (red) was found in both the dCLIP set and top TargetScan7 set. Similarity between performance of the TargetScan7 and dCLIP sets (purple and green, respectively) and TargetScan7 and intersection sets (blue and red, respectively) was tested (two-sided K–S test, P values); the number of mRNAs analyzed in each category is in parentheses. TargetScan7 scores for mouse mRNAs were generated using human parameters for all features. (F–H) Comparison of top TargetScan7 predicted targets to mRNAs with canonical binding sites identified using photoactivatable-ribonucleoside-enhanced CLIP (PAR- CLIP) (Hafner et al., 2010; Lipchina et al., 2011). Plotted are cumulative distributions of 120 mRNA fold changes after either transfecting miR-7 (F) or miR-124 (G) into HEK293 cells, or knocking down miR-302/367 in hESCs (H). Otherwise these panels are as in (A– E). (I) Comparison of top TargetScan7 predicted targets to mRNAs with canonical sites identified using CLASH (Helwak et al., 2013). Plotted are cumulative distributions of mRNA fold changes after knockdown of 25 miRNAs from 14 miRNA families in HEK293 cells. For each of these miRNA families, a cohort of top TargetScan7 predictions was chosen to match the number of mRNAs with CLASH-identified canonical sites, and the union of these TargetScan7 cohorts was analyzed. The total number of TargetScan7 predictions did not match the number of CLASH-identified targets due to slightly different overlap between mRNAs targeted by different miRNAs. Otherwise these panels are as in (A–E). (J) Comparison of top TargetScan7 predicted targets to mRNAs with chimera-identified canonical sites (Grosswendt et al., 2014). Otherwise this panel is as in (I). (K) Comparison of top TargetScan7 predicted targets to mRNAs with canonical binding sites identified using pulldown-seq (Tan et al., 2014). Plotted are cumulative distributions of mRNA fold changes after transfecting miR-522 into triple-negative breast cancer (TNBC) cells. Otherwise this panel is as in (A–E). (L) Comparison of top TargetScan7 predicted targets to mRNAs with canonical sites identified using IMPACT-seq (Tan et al., 2014). Otherwise this panel is as in (K). Figure 7. Example display of TargetScan7 predictions. The example shows a TargetScanHuman page for the 3′ UTR of the LRRC1 gene. At the 121 top is the 3′-UTR profile, showing the relative expression of tandem 3′-UTR isoforms, as measured using 3P-seq (Nam et al., 2014). Shown on this profile is the end of the longest Gencode annotation (blue vertical line) and the total number of 3P-seq reads (339) used to generate the profile (labeled on the y-axis). Below the profile are predicted conserved sites for miRNAs broadly conserved among vertebrates (colored according to the key), with options to display conserved sites for mammalian conserved miRNAs, or poorly conserved sites for any set of miRNAs. Boxed are the predicted miR-124 sites, with the site selected by the user indicated with a darker box. The multiple sequence alignment shows the species in which an orthologous site can be detected (white highlighting) among representative vertebrate species, with options to display site conservation among all 84 vertebrate species. Below the alignment is the predicted consequential pairing between the selected miRNA and its sites, showing also for each site its position, site type, context++ score, context++ score percentile, weighted context++ score, branch- length score, and PCT score. Figure 7–figure supplement 1. Flowchart of the computational pipeline used to build the TargetScan7 database. 122 mRNA fold change (log2) −0.25 0 0.25 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 0.5−0.5 MIRZA, top 100 P < 10−6 (1600) MIRZA, no 6mers P=0.06 (1600) 8mer P < 10−122 (837) 7mer-m8 P < 10−103 (2253) 7mer-A1 P < 10−51 (1735) 6mer P < 10−19 (5061) No site (48945) Agarwal et al. Fig 1 A D B E C F mRNA fold change (log2) −1 0 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 1.5−1.5 Offset 6mer P < 10−4 (539) Nucleation bulge P = 0.43 (134) 8mer P < 10−3 (35) 7mer-m8 P < 10−8 (234) 7mer-A1 P < 10−8 (96) 6mer P < 10−11 (421) No site (3001) 6 hr zebrafish embryo, MZDicer vs WT, miR-430 targets 1 HeLa cells, 16 miRNA transfections Canonical, CLASH-supported P < 10−4 (32) Non-canonical P = 0.03 (397) 8mer P < 10−27 (133) 7mer-m8 P < 10−13 (285) 7mer-A1 P = 0.01 (325) 6mer P < 0.01 (781) No site (5648) C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) −0.3 −0.1 0 0.30.1 0.2−0.2 HEK293 cells, miR-92a knockdown HEK293 cells, knockdown of 25 miRNAs, miR-92a targets Canonical, CLASH-supported P < 10−3 (32) Non-canonical P = 0.13 (403) 8mer P < 10−21 (133) 7mer-m8 P < 10−4 (282) 7mer-A1 P < 0.01 (368) 6mer P = 0.02 (789) No site (5842) C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) −0.3 −0.1 0 0.30.1 0.2−0.2 −0.5 0.5 Th1 cells, miR-155 knockout mRNA fold change (log2) −1 −0.5 0 10.5 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 Canonical, dCLIP-supported P < 10−15 (54) Non-canonical P = 0.77 (28) 8mer P < 10−12 (108) 7mer-m8 P < 10−3 (186) 7mer-A1 P < 10−9 (161) 6mer P = 0.07 (331) No site (4984) C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) −1 −0.5 0 10.5 Canonical, dCLIP-supported P < 10−8 (63) Non-canonical P < 10−3 (32) 8mer P < 10−6 (134) 7mer-m8 P < 10−3 (232) 7mer-A1 P = 0.09 (203) 6mer P < 0.01 (400) No site (6287) T cells, miR-155 knockout G Canonical, chimera-supported P < 10−7 (76) Non-canonical P = 0.29 (97) 8mer P < 10−27 (133) 7mer-m8 P < 10−13 (285) 7mer-A1 P = 0.02 (325) 6mer P = 0.01 (781) No site (5957) C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) −0.3 −0.1 0 0.30.1 0.2−0.2 HEK293 cells, miR-92a knockdown H C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) −1 −0.5 0 10.5 Canonical, IMPACT- seq-supported P < 10−4 (46) Non-canonical P = 0.16 (885) 8mer P < 10−15 (77) 7mer-m8 P < 10−18 (311) 7mer-A1 P < 10−7 (238) 6mer P < 10−9 (1111) No site (4137) TNBC cells, miR-522 transfection 123 mRNA fold change (log2) −1.5 −0.5 0 1.50.5 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 1−1 mRNA fold change (log2) −1.5 −0.5 0 1.50.5 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 1−1 mRNA fold change (log2) −1.5 −0.5 0 1.50.5 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 1−1 mRNA fold change (log2) −1.5 −0.5 0 1.50.5 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 1−1 Offset 6mer P = 0.13 (421) Nucleation bulge P = 0.9 (212) 8mer P < 10−6 (45) 7mer-m8 P < 10−6 (173) 7mer-A1 P < 0.01 (141) 6mer P < 0.01 (640) No site (3161) Offset 6mer P = 0.02 (517) Nucleation bulge P = 0.78 (264) 8mer P < 10−13 (50) 7mer-m8 P < 10−26 (218) 7mer-A1 P = 0.02 (175) 6mer P = 0.43 (775) No site (3968) Offset 6mer P < 0.01 (503) Nucleation bulge P = 0.44 (137) 8mer P < 10−11 (49) 7mer-m8 P < 10−21 (229) 7mer-A1 P < 10−15 (110) 6mer P < 10−12 (415) No site (2942) Offset 6mer P < 10−6 (513) Nucleation bulge P = 0.15 (258) 8mer P < 10−15 (53) 7mer-m8 P < 10−26 (219) 7mer-A1 P < 10−3 (173) 6mer P = 0.11 (772) No site (3876) mRNA fold change (log2) −0.5 0 0.5 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 1−1 Offset 6mer P < 10−24 (17826) Nucleation bulge P = 0.06 (4667) 8mer P < 10−252 (1503) 7mer-m8 P < 10−296 (4534) 7mer-A1 P < 10−158 (3886) 6mer P < 10−65 (12086) No site (171917) F C A D E HEK293 cells, miR-124 transfection HeLa cells, miR-124 transfection Huh7 cells, miR-124 transfection HeLa cells, 74 sRNA transfections 9 hr zebrafish embryo, MZDicer+miR430 vs MZDicer, miR-430 targets mRNA fold change (log2) −0.5 0 0.5 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 1−1 Offset 6mer P < 10−12 (2928) Nucleation bulge P = 0.96 (1042) 8mer P < 10−75 (333) 7mer-m8 P < 10−55 (871) 7mer-A1 P < 10−34 (914) 6mer P < 10−17 (2597) No site (25234) HeLa cells, 11 miRNA transfections B Agarwal et al. Fig 1-figure supplement 1 124 BA mRNA fold change (log2) −1 −0.5 0 10.5 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 Th2 cells, miR-155 knockout Canonical, dCLIP-supported P < 10−3 (56) Non-canonical P = 0.91 (28) 8mer P < 10−5 (109) 7mer-m8 P = 0.58 (194) 7mer-A1 P = 0.33 (163) 6mer P = 0.98 (337) No site (4956) Canonical, dCLIP-supported P < 10−19 (67) Non-canonical P = 0.21 (32) 8mer P < 10−13 (104) 7mer-m8 P < 10−10 (178) 7mer-A1 P < 10−13 (160) 6mer P < 10−5 (304) No site (4178) mRNA fold change (log2) −1 −0.5 0 10.5 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 B cells, miR-155 knockout Agarwal et al. Fig 1-figure supplement 2 125 A C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) −0.4 0 0.40.2−0.2 B C C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) −0.3 −0.1 0 0.30.1 0.2−0.2 Canonical, chimera-supported P < 10−20 (654) Non-canonical P = 0.11 (122) C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) −0.3 −0.1 0 0.30.1 0.2−0.2 HEK293 cells, knockdown of 25 miRNAs, targets of 4 families, all CLASH-supported sites HEK293 cells, knockdown of 25 miRNAs, targets of 4 families, all chimera-supported sites Canonical, chimera-supported P < 10−3 (17) Non-canonical P = 0.83 (11) 8mer P < 10−5 (88) 7mer-m8 P < 0.01 (639) 7mer-A1 P < 10−6 (150) 6mer P = 0.04 (636) No site (5664) Canonical, CLASH-supported P < 10−7 (439) Non-canonical P = 0.16 (267) 8mer P < 10−17 (254) 7mer-m8 P < 10−5 (877) 7mer-A1 P < 10−5 (599) 6mer P < 10−3 (2183) No site (500) D E C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) −0.3 −0.1 0 0.30.1 0.2−0.2 HEK293 cells, knockdown of 25 miRNAs, targets of 4 families, 3′ UTR CLASH-supported sites C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) −0.3 −0.1 0 0.30.1 0.2−0.2 HEK293 cells, knockdown of 25 miRNAs, targets of 4 families, 3′ UTR chimera-supported sites Canonical, CLASH-supported P < 10−8 (178) Non-canonical P = 0.11 (78) 8mer P < 10−17 (254) 7mer-m8 P < 10−5 (877) 7mer-A1 P < 10−5 (599) 6mer P < 10−3 (2183) No site (500) 8mer P < 10−18 (254) 7mer-m8 P < 10−5 (877) 7mer-A1 P < 10−4 (599) 6mer P < 10−3 (2183) No site (500) Canonical, chimera-supported P < 10−20 (486) Non-canonical P = 0.92 (57) 8mer P < 10−17 (254) 7mer-m8 P < 10−5 (877) 7mer-A1 P < 10−4 (599) 6mer P < 10−3 (2183) No site (500) hESC cells, miR-302/367 knockdown, miR-302 targets Agarwal et al. Fig 1-figure supplement 3 126 6 hr zebrafish embryo, MZDicer vs WT, miR-430 targets Offset 6mer P < 10−6 (384) Nucleation bulge P = 0.26 (104) 8mer P < 10−3 (27) 7mer-m8 P < 10−21 (161) 7mer-A1 P < 10−10 (79) 6mer P < 10−14 (303) No site (2088) RPF fold change (log2) C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 −1 0 2−2 1 A B C RPF fold change (log2) −1 −0.5 0 10.5 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 Canonical, dCLIP-supported P < 10−20 (67) Non-canonical P = 0.09 (32) 8mer P < 10−10 (104) 7mer-m8 P < 10−10 (178) 7mer-A1 P < 10−6 (160) 6mer P < 10−3 (304) No site (4178) B cells, knockout of miR-155 Protein fold change (log2) −1 −0.5 0 10.5 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 Canonical, chimera- supported P < 10−10 (45) Non-canonical P = 0.71 (96) 8mer P < 10−7 (42) 7mer-m8 P < 10−11 (90) 7mer-A1 P < 10−13 (95) 6mer P < 10−5 (195) No site (2192) HeLa, miR-155 transfection Agarwal et al. Fig 1-figure supplement 4 127 AAgarwal et al. Fig 1-figure supplement 5 −4 −2 0 2 4 Conservation z scores C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 Sites of broadly conserved miRNAs, 3' UTR controls (1215) Sites of less conserved miRNAs, 3' UTR controls (228) Sites of broadly vs less conserved miRNAs, 3' UTR controls P = 0.86 B 0.0 0.5 1.0 1.5 2.0 2.5 Human miRNA families M ea n br an ch -le ng th s co re mi R- 25 mi R- 19 let -7 mi R- 15 mi R- 37 4 mi R- 10 mi R- 18 1 mi R- 19 6 mi R- 13 0 mi R- 17 Conservation of 2–7 match with one mismatch Conservation of non-canonical binding sites of broadly conserved and less conserved miRNAs ** * ** * ** * * ** Chimera-supported sites Control sites in AGO clusters Control sites in 3' UTRs Sites of broadly conserved miRNAs, AGO cluster controls (1215) Sites of less conserved miRNAs, AGO cluster controls (228) Sites of broadly vs less conserved miRNAs, AGO cluster controls P = 0.88 128 Agarwal et al. Fig 2 A UUAGCUAGUCCUACGAUCUUCA E = 10−112247/545 miR-155 B its 0 1 2 a UCGUAAUu-5’ G C A U U G AC G U G U A C U CAU E = 10−100202/290 miR-196 B its 0 1 2 uUGAUGGAu-5’ G C A U CUUCUUCCAUGAUC E = 0.00629/176 miR-320 B its 0 1 2 g g a g a g u nggGUCGAAAa-5’ A UG C A UGCA GCCUCUCAUC E = 10−3251/122 miR-7uCAGAAGGu-5’ B its 0 1 2 C A G U A C U G A GU U C A G U G C C UA C UA C AU E = 10−78 310/310 miR-25/32/92n CACGUUAn-5’ B its 0 1 2 C B 1 8 20 Position from miRNA 5’ end 1 8 20 Position from miRNA 5’ end Non-canonical CLASH motifs Non-canonical chimera motifs C U A G UA C G U C G U A G CUAUAGAG E = 10−1745/122 miR-10a g a UGUCCCAu-5’ B its 0 1 2 G C A U C G A U C A G U A G U A C G G UC G U AC E = 10−107220/321 miR-19u AAACGUGu-5’ B its 0 1 2 GAGGCACAU E = 10−318/79 miR-130/301 B its 0 1 2 AACGUGAc-5’ A C U G A C U A UGCUUACCA E = 10−521/64 miR-30ACAAAUGu-5’ B its 0 1 2 G A UGCCUUUAUA E = 190013/83 miR-155uCGUAAUu-5’ B its 0 1 2 U A C G A UGCGCCCUGCU E = 10−45158/3525 miR-124 B its 0 1 2 CA _CGGAAu-5’ 129 AAgarwal et al. Fig 2-figure supplement 1 A G C U C GA C G U ACCUCCUABits E = 10−80168/605 miR-98/let-7 0 1 2 GAUGGAGn-5’ GAUCCAGUUCUUG let-7 E = 10−11 35/605 u n ga uGAUGGAGn-5’ B its 0 1 2 miR-98u g aa uGAUGGAGn-5’ C A UGCACAUCAGU E = 10−2334/86 miR-18a CGUGGAAu-5’ B its 0 1 2 miR-18a c g u g a ucuaCGUGGAAu-5’ GAUGCGGUCGUAAUCACU E = 0.00118/68 miR-29uACCACGAu-5’ B its 0 1 2 UGACAAUAGCA E = 10−614/104 miR-33GUUA _CGUg-5’ B its 0 1 2 UAUUACCUU miR-98/let-7 E = 10−23 45/605 B its 0 1 2 nGAUGGAGn-5’ C G A U C G U A G U C U GUUGCUCCUA E = 10−3283/150 miR-378/422 B its 0 1 2 n nUCAGGUCa-5’ A G C U U G A A U C A U C U C G AUAAUGC E = 10−172143/174 miR-423-3p B its 0 1 2 cUGGCUCGa-5’ CCUCUA E = 10−533/57 miR-423-5p B its 0 1 2 CGGGGAGu-5’ GUGCAUACAUAGA E = 0.0229/58 miR-148/152 B its 0 1 2 ACGUGACu-5’ G C A U C G A UGGACAUUGU E = 10−187149/231 miR-181 B its 0 1 2 uACUUACAa-5’ UUAGAUAUUCACAUA E = 10−327/75 miR-374 B its 0 1 2 AUAA _UAUu-5’ CLASH, without canonical site CLASH, with canonical site CLASH, overlapping 3’ UTR CLASH, other region Chimera, without canonical site Chimera, with canonical site Chimera, overlapping 3’ UTR Chimera, other region B C G A U A C U GCAUUGGCUCAGUA E = 10−4367/104 miR-33GUUACGUg-5’ B its 0 1 2 miR-33g u u a c g uugnnGUUACGUg-5’ C A GUCUAGCUAUGAC E = 10−337/87 miR-221/222 B its 0 1 2 UACAUCGa-5’ miR-221g g u c g ucuguUACAUCGa-5’ E = 10−42 114/441 miR-17/20/93/106 UGAUCUGAGCAUGCAUACU uCGUGAAAn-5’ B its 0 1 2 miR-17/20/93/106a c g u gnnnnuCGUGAAAn-5’ U A C G C A U C G A U A U CUUGGCACU E = 10−32139/165 miR-15/16cACGACGAu-5’ B its 0 1 2 miR-15/16nu n cACGACGAu-5’ GACUGUA E = 0.03615/91 miR-101CAUGACAu-5’ B its 0 1 2 miR-101u g uCAUGACAu-5’ C U A G A GCAGAGCGAUGUGCUAG E = 10−49265/4759 miR-522 transfection in TNBC cells, IMPACT-seq B its 0 1 2 G CCCAUUCGCCUC miR-124 transfection in HeLa cells, dCLIP E = 10−23 143/3525B its 0 1 2 130 BA Agarwal et al. Fig 2-figure supplement 2 G A U G A U C G U C A U A GUCGCUA E = 10−4273/80 miR-17/20/93/106u g g a c g u g n n n n u CGUGAAAn-5’ B its 0 1 2 CUGAGCUA E = 0.001312/50 miR-23g a c c gUUACACUa-5’ B its 0 1 2 G U A G UAGCUCAACAU E = 10−420/79 miR-27a u c g gUGACACUu-5’ B its 0 1 2 C U A G G U A C GUCCU E = 10−1955/59 miR-142-5pc a c g aaagaUGAAAUAc-5’ B its 0 1 2 G A CACCUC E = 0.01843/139 miR-48/84/let-7GAUGGAGu-5’ B its 0 1 2 AUCUAAUAUCGAAUGUAC E = 0.1628/64 miR-50/62/90UGUA _UAGu-5’ B its 0 1 2 G C A U C U G A U C A C U G A U G A U G C A U C AU E = 10−294 503/600 miR-51/52/53/54/55/56AUGCCCAn-5’ B its 0 1 2 GAGAUCUAACUCCUA E = 10−3189/465 miR-58GCUA _GAGu-5’ B its 0 1 2 A CUGUGUCUCCAU E = 10−744/465 miR-58GCUAGAGu-5’ B its 0 1 2 G C U AUGAUCAUAU E = 10−947/465 miR-58u GCUAGAGu-5’ B its 0 1 2 A UUGGUAGACUCCUAACU E = 10−2038/108 miR-63/64/65/229u CACAGUAn-5’ B its 0 1 2 C G A U A U G C G U C G A U U C C U A C U A C UA E = 10−26 79/96 miR-238/239UCAUGUUu-5’ B its 0 1 2 C A G U G C G U A C G A U A U C C A U A U C G C U A E = 10−29 286/286 miR-80/81/82ACUAGAGu-5’ B its 0 1 2 CUAGUCGCUCAUAUCGAGUCACCUA E = 10−1071/71 miR-72/73/74AGAACGGn-5’ B its 0 1 2 A C G U G A C UG CA U CUUUC E = 0.01958/142 miR-71aCAGAAAGu-5’ B its 0 1 2 ACUACCUUAUGCG E = 0.02914/80 Bits 012 miR-17/20/93/106 U A G A U A GGCACUCGAU E = 10−2727/111 miR-124CA _CGGAAu-5’ B its 0 1 2 131 Publication Birmingham 2006 Anderson 2008 Grimson 2007 Lim 2005 Jackson 2006a & b Schwartz 2006 A B C D E F G H −0.2 0.2 0.6 rs C or re la tio n (r s) to m R N A fo ld c ha ng e −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 B’06 A’08 G’07 L’05 J’06 S’06 Feature 3′ UTR length 3′ UTR AU content C or re la tio n (r s) to m R N A fo ld c ha ng e 1 sRNA 1, ≥1 site sRNA 2, ≥1 site No site B’06 A’08 G’07 L’05 J’06 S’06 rs = -0.10 −1.5 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 mRNA fold change (log2), sRNA 1 m R N A fo ld c ha ng e (lo g 2 ), sR N A 2 −1.0 sRNA 1 & 2, ≥1 site rs = 0.55 −1.5 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 mRNA fold change (log2), sRNA 1 m R N A fo ld c ha ng e (lo g 2 ), sR N A 2 −1.0 Raw, ≥1 site Normalized, ≥1 site Raw, no site Normalized, no site −0.5 −0.25 0.0 0.25 0.5 mRNA fold change (log2) 0.0 0.2 0.4 0.6 0.8 1.0 C um ul at iv e fra ct io n Raw, ≥1 site Normalized, ≥1 site Raw, no site Normalized, no site −0.5 −0.25 0.0 0.25 0.5 mRNA fold change (log2) 0.0 0.2 0.4 0.6 0.8 1.0 C um ul at iv e fra ct io n Agarwal et al. Fig 3 132 A B C let−7 miR−21 miR−17 miR−24 miR−27 miR−30 miR−15 miR−26 miR−25 miR−29 Other 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) C um ul at iv e fra ct io n −0.5 0 0.5 let−7 P < 10−101 miR−21 P < 10−14 miR−17 P < 10−157 miR−24 P = 0.02 miR−27 P < 10−54 miR−30 P < 10−107 miR−15 P < 10−34 miR−26 P < 10−120 miR−25 P = 0.26 miR−29 P < 10−17 No site 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) C um ul at iv e fra ct io n −0.5 0 0.5 let−7 P = 0.007 miR−21 P = 0.06 miR−17 P = 0.002 miR−24 P = 0.98 miR−27 P = 0.33 miR−30 P = 0.009 miR−15 P = 0.41 miR−26 P = 0.002 miR−25 P = 0.71 miR−29 P = 0.32 No site Agarwal et al. Fig 3-figure supplement 1 133 A B 10 W in do w s iz e 15 5 20 1 >0.00 -0.02 -0.04 -0.06 -0.08 -0.10 Partial correlation −10 Position relative to seed match +10 +15−15 +5−5 NNNNNNNNNNNNNNNNNNNNN-5′ miRNA Context only Context+ Stepwise 6mer 7mer-A1 7mer-m8 8mer 0.00 0.05 0.10 0.15 0.20 0.25 r2 to h el d− ou t d at a C Agarwal et al. Fig 4 (In te rc ep t) TA _3 U TR S P S A sR N A 1 C G Lo ca l_ A U 3P _s co re S A Le n_ O R F Le n_ 3U TR M in _d is t O ff6 m O R F8 m 8mer 7mer-m8 7mer-A1 6mer C oe ffi ci en t P C T −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 A sR N A 8 C G A S ite 8 C G 134 Agarwal et al. Fig 5 Average top predictions considered per miRNA 4 8 16 32 64 128 256 512 1024 2048 4096 M ed ia n m R N A fo ld c ha ng e ( lo g 2 ) −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 −0.7 miRmap (2013) mRNA fold change (log2) −0.5 0 0.5 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 1.0−1.0 HCT116 cells, 7 miRNA transfections context++ (2015) TargetScan6.All (2012) TargetScan5.All (2008) TargetScan6.Cons (2012) TargetScan5.Cons (2008) TargetScan.PCT (2012) TargetSpy (2010) RNA22 (2011) PITA.Top (2008) PITA.All (2008) PicTarF (2012) PicTarC (2012) PicTarM (2012) MirTarget2 (2012) miRSVR (2010) miRanda-MicroCosm (2008) ElMMO2 (2011) DIANA.microT.CDS (2012) AnTarTsfxn (2011) AnTarCLIP (2011) PACCMIT-CDS.Cons (2013) PACCMIT-CDS.All (2013) TargetRank (2007) Predicted miRNA−target interactions (average per miRNA) 0 10 00 20 00 30 00 40 00 0 0.1 0.150.05 r2 to test set 0 0.10.05 r2 to test set B CA seed-MIRZA-G-C (2015) MIRZA-G-C (2015) mRNAs with 7–8 nt site mRNAs with conserved 7–8 nt site All mRNAs TargetScan5.All (112) TargetScan6.All (112) context++ (112) DMBSTAR (2015) seed-MIRZA-G (2015) MIRZA-G (2015) SVMicrO (2011) N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A FE ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● AnTarCLIP AnTarTsfxn DIANA.microT.CDS ElMMO2 MBSTAR miRanda-MicroCosm miRmap miRSVR MirTarget2 MIRZA-G MIRZA-G-C seed-MIRZA-G seed-MIRZA-G-C PACCMIT-CDS.All PACCMIT-CDS.Cons PicTarM PicTarC PicTarF PITA.All PITA.Top RNA22 SVMicrO TargetRank TargetSpy TargetScan.PCT TargetScan5.All TargetScan5.Cons TargetScan6.All TargetScan6.Cons context++ ● AnTarCLIP AnTarTsfxn DIANA.microT.CDS MBSTAR miRanda-MicroCosm miRmap miRSVR MIRZA-G MIRZA-G-C PACCMIT-CDS PACCMIT-CDS.Cons PITA.All RNA22 SVMicrO TargetSpy context++ Average top predictions considered per miRNA 4 8 16 32 64 128 256 512 1024 2048 4096 M ed ia n m R N A fo ld c ha ng e ( lo g 2 ) −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 −0.7 135 Agarwal et al. Fig 5-figure supplement 1 B A C ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Average top predictions considered per miRNA 4 8 16 32 64 128 256 512 1024 2048 4096 M ea n m R N A fo ld c ha ng e ( lo g 2 ) −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 −0.7 Average top predictions considered per miRNA 4 8 16 32 64 128 256 512 1024 2048 4096 M ea n m R N A fo ld c ha ng e ( lo g 2 ) −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 −0.7 mRNAs with 7–8 nt site ● ● AnTarCLIP AnTarTsfxn DIANA.microT.CDS ElMMO2 MBSTAR miRanda-MicroCosm miRmap miRSVR MirTarget2 MIRZA-G MIRZA-G-C seed-MIRZA-G seed-MIRZA-G-C PACCMIT-CDS.All PACCMIT-CDS.Cons PicTarM PicTarC PicTarF PITA.All PITA.Top RNA22 SVMicrO TargetRank TargetSpy TargetScan.PCT TargetScan5.All TargetScan5.Cons TargetScan6.All TargetScan6.Cons context++ AnTarCLIP AnTarTsfxn DIANA.microT.CDS MBSTAR miRanda-MicroCosm miRmap miRSVR MIRZA-G MIRZA-G-C PACCMIT-CDS PACCMIT-CDS.Cons PITA.All RNA22 SVMicrO TargetSpy context++ mRNAs with conserved 7–8 nt site −1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 C um ul at iv e fra ct io n mRNA fold change (log2) HCT116 cells, 7 miRNA transfections All mRNAs AnTarCLIP AnTarTsfxn DIANA.microT.CDS ElMMO2 MBSTAR miRanda-MicroCosm miRmap miRSVR MirTarget2 MIRZA-G MIRZA-G-C seed-MIRZA-G seed-MIRZA-G-C PACCMIT-CDS.All PACCMIT-CDS.Cons PicTarM PicTarC PicTarF PITA.All PITA.Top RNA22 SVMicrO TargetRank TargetSpy TargetScan.PCT TargetScan5.All TargetScan5.Cons TargetScan6.All TargetScan6.Cons context++ 136 A B C D Agarwal et al. Fig 6 E F G H I J TNBC cells, miR-522 transfection Canonical, IMPACT-seq supported P < 10−4 (46) Top TargetScan7 P < 10−26 (46) No site (5024) −1.0 −0.5 0 0.5 1.0 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) TargetScan7 vs IMPACT-seq P < 10−6 Intersection P < 0.01 (3) Top TargetScan7 P < 0.01 (3) TNBC cells, miR-522 transfection −1.0 −0.5 0 0.5 1.0 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) Canonical, pulldown-seq supported P < 10−81 (259) Top TargetScan7 P < 10−68 (259) No site (5024) Intersection P < 10−38 (78) Top TargetScan7 P < 10−28 (78) TargetScan7 vs pulldown-seq P = 0.78 TargetScan7 vs intersection P = 0.54 K L −0.4 −0.2 0 0.2 0.4 mRNA fold change (log2) C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 Canonical, PAR-CLIP- supported P < 10−23 (429) Top TargetScan7 P < 10−33 (429) No site (5675) hESC cells, miR-302/367 knockdown, miR-302 targets TargetScan7 vs PAR-CLIP P = 0.18 TargetScan7 vs intersection P = 0.91 Intersection P < 10−21 (128) Top TargetScan7 P < 10−19 (128) HeLa cells, miR-124 transfection −1.0 −0.5 0 0.5 1.0 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) Canonical, dCLIP- supported P < 10−26 (346) Top TargetScan7 P < 10−70 (346) No site (7022) TargetScan7 vs dCLIP P < 10−7 Intersection P < 10−23 (62) Top TargetScan7 P < 10−20 (62) TargetScan7 vs intersection P = 0.82 HEK293 cells, miR-124 transfection Canonical, PAR-CLIP- supported P < 10−25 (345) Top TargetScan7 P < 10−65 (345) No site (4992) −1.0 −0.5 0 0.5 1.0 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) TargetScan7 vs PAR-CLIP P < 10−9 TargetScan7 vs intersection P = 0.24 Intersection P < 10−14 (70) Top TargetScan7 P < 10−22 (70) −1.0 −0.5 0 0.5 1.0 Canonical, PAR-CLIP- supported P < 10−8 (49) Top TargetScan7 P < 10−19 (49) No site (5309) C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) HEK293 cells, miR-7 transfection TargetScan7 vs PAR-CLIP P < 0.01 Intersection P < 0.05 (4) Top TargetScan7 P < 10−3 (4) TargetScan7 vs intersection P = 1 TargetScan7 vs intersection P = 0.6 HEK293 cells, knockdown of 25 miRNAs, targets for all 25 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) −0.3 −0.1 0 0.30.1 0.2−0.2 Canonical, chimera- supported P < 10−43 (709) Top TargetScan7 P < 10−68 (724) No site (1421) TargetScan7 vs chimera P < 0.01 Intersection P < 10−47 (184) Top TargetScan7 P < 10−36 (225) TargetScan7 vs intersection P = 0.24 HEK293 cells, knockdown of 25 miRNAs, targets for all 25 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) −0.3 −0.1 0 0.30.1 0.2−0.2 Canonical, CLASH- supported P < 10−14 (337) Top TargetScan7 P < 10−33 (336) No site (1217) TargetScan7 vs CLASH P < 10−4 Intersection P < 10−7 (30) Top TargetScan7 P < 10−5 (37) TargetScan7 vs intersection P = 0.79 Th1 cells, miR-155 knockout −1.0 −0.5 0 0.5 1.0 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) Canonical, dCLIP- supported P < 10−18 (63) Top TargetScan7 P < 10−18 (63) No site (5012) TargetScan7 vs dCLIP P = 0.94 Intersection P < 10−8 (10) Top TargetScan7 P < 10−4 (10) TargetScan7 vs intersection P = 0.4 Th2 cells, miR-155 knockout −1.0 −0.5 0 0.5 1.0 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) Canonical, dCLIP- supported P < 10−5 (65) Top TargetScan7 P < 10−5 (65) No site (4984) TargetScan7 vs dCLIP P = 0.94 Intersection P < 10−4 (10) Top TargetScan7 P = 0.09 (10) TargetScan7 vs intersection P = 0.4 B cells, miR-155 knockout −1.0 −0.5 0 0.5 1.0 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) Canonical, dCLIP- supported P < 10−23 (79) Top TargetScan7 P < 10−26 (79) No site (4210) TargetScan7 vs dCLIP P = 0.98 Intersection P < 10−11 (18) Top TargetScan7 P < 10−6 (18) TargetScan7 vs intersection P = 0.49 T cells, miR-155 knockout −1.0 −0.5 0 0.5 1.0 C um ul at iv e fra ct io n 0.0 0.2 0.4 0.6 0.8 1.0 mRNA fold change (log2) Canonical, dCLIP- supported P < 10−10 (72) Top TargetScan7 P < 10−4 (72) No site (6319) TargetScan7 vs dCLIP P = 0.06 Intersection P < 10−3 (12) Top TargetScan7 P < 0.05 (12) TargetScan7 vs intersection P = 0.85 137 Agarwal et al. Fig 7 138 Get 3’ UTR coordinates of protein-coding Gencode transcripts Compare to other gene model resources Link 3P-seq clusters to gene models Infer longest 3’ UTR for each stop codon Collect aligned 3’ UTRs Calculate median branch length score (BLS) of each 3’ UTR alignment Partition 3’ UTRs into 10 conservation bins Partition 3’ UTRs by conservation Calculate site conservation metrics Calculate BLS of each site for sites to broadly conserved miRNAs Assign conservation status using BLS thresholds Calculate PCT from BLS Aggregate normalized 3P-seq clusters for each reference 3’ UTR Calculate 3’ UTR isoform ratios along UTR length Find seed-matched sites Find 6mer, 7mer-A1, 7mer-m8, and 8mer sites in all reference 3’ UTRs and their orthologs Collect ORFs Identify set of representative ORF coordinates corresponding to each reference 3’ UTR Extract ORF sequences from multiz alignments Create web interface Design scripts to access database and display results by miRNA family or gene/transcript ID for each organism Provide options to rank targets for each miRNA and miRNAs targeting each mRNA Get coordinates of reference 3’ UTRs Mask regions overlapping ORFs in other transcripts Extract multiz alignments Group miRNAs into families Acquire miRNA annotations for key vertebrate species Modify annotation of conserved miRNAs based on miRNA catalogs Summarize target predictions Calculate total weighted context++ scores Calculate aggregate PCTs (for sites to broadly conserved miRNA families) for reference 3’ UTRs For each miRNA family, tally the number of sites of each type per target Load all data into MySQL database Group miRNAs with the same sequence at positions 2 – 8 into families Identify miRNA families that are conserved among mammals or are more broadly conserved among vertebrates Curate alternative isoforms of conserved families Calculate context++ score for each site Score features of miRNA families Score features of mRNAs Score features of sites Agarwal et al. Fig 7-figure supplement 1 139 Tables Table 1. The 26 features considered in the models, highlighting the 14 robustly selected through stepwise regression (bold). The feature description does not include the scaling performed (Table 3) to generate more comparable regression coefficients. Feature Abbreviation Description Frequency chosen 8mer 7mer-m8 7mer-A1 6mer miRNA 3′-UTR target-site abundance TA_3UTR Number of sites in all annotated 3′ UTRs (Arvey et al., 2010; Garcia et al., 2011) 100% 100% 100% 100% ORF target-site abundance TA_ORF Number of sites in all annotated ORFs (Garcia et al., 2011) 9.4% 0.7% 68.1% 93.4% Predicted seed-pairing stability SPS Predicted thermodynamic stability of seed pairing (Garcia et al., 2011) 100% 100% 100% 100% sRNA position 1 sRNA1 Identity of nucleotide at position 1 of the sRNA 68% 100% 99.7% 97.7% sRNA position 8 sRNA8 Identity of nucleotide at position 8 of the sRNA 0% 0.8% 100% 100% site Site position 1 site1 Identity of nucleotide at position 1 of the site N/A 57.1% N/A 2% Site position 8 site8 Identity of nucleotide at position 8 of the site 0.8% 95.1% 99.4% 100% Site position 9 site9 Identity of nucleotide at position 9 of the site (Lewis et al., 2005; Nielsen et al., 2007) 15.4% 7.1% 0.9% 93.7% Site position 10 site10 Identity of nucleotide at position 10 of the site (Nielsen et al., 2007) 0.1% 100% 8.5% 26.3% Local AU content local_AU AU content near the site (Grimson et al., 2007; Nielsen et al., 2007) 100% 100% 100% 100% 3′ supplementary pairing 3P_score Supplementary pairing at the miRNA 3′ end (Grimson et al., 2007) 42.5% 100% 100% 100% Distance from stop codon dist_stop log10(Distance of site from stop codon) 62.4% 10.8% 8.7% 25.7% Predicted structural accessibility SA log10(Probability that a 14-nt segment centered on the match to sRNA positions 7 and 8 is unpaired) 100% 100% 100% 100% Minimum distance min_dist log10(Minimum distance of site from stop codon or polyadenylation site) (Grimson et al., 2007) 99.9% 100% 87.4% 100% Probability of conserved targeting PCT Probability of site conservation, controlling for dinucleotide evolution and site context (Friedman et al., 2009) 100% 100% 100% 20.8% mRNA 5′-UTR length len_5UTR log10(Length of the 5′ UTR) 98.2% 8.2% 4.6% 17.2% ORF length len_ORF log10(Length of the ORF) 100% 100% 100% 100% 3′-UTR length len_3UTR log10(Length of the 3′ UTR) (Hausser et al., 2009) 100% 100% 100% 100% 5′-UTR AU content AU_5UTR Fraction of AU nucleotides in the 5′ UTR 13% 38.9% 91.1% 31.3% ORF AU content AU_ORF Fraction of AU nucleotides in the ORF 1.2% 72.4% 28.4% 35.8% 3′-UTR AU content AU_3UTR Fraction of AU nucleotides in the 3′ UTR (Robins and Press, 2005; Hausser et al., 2009) 5.4% 73.3% 65.3% 80.6% 3′-UTR offset 6mer sites off6m Number of offset 6mer sites in the 3′ UTR (Friedman et al., 2009) 65.9% 89.6% 99.8% 100% ORF 8mer sites ORF8m Number of 8mer sites in the ORF (Lewis et al., 2005; Reczko et al., 2012) 99.5% 99.1% 100% 100% ORF 7mer-m8 sites ORF7m8 Number of 7mer-m8 sites in the ORF (Reczko et al., 2012) 4.7% 4.3% 85.3% 100% ORF 7mer-A1 sites ORF7A1 Number of 7mer-A1 sites in the ORF (Reczko et al., 2012) 68.4% 34.2% 97.8% 98.4% ORF 6mer sites ORF6m Number of 6mer sites in the ORF (Reczko et al., 2012) 91% 13.3% 0.7% 36.7% 140 Table 2. Summary of datasets analyzed in this study, and corresponding figures using the datasets. Supplemental figures are abbreviated (e.g., “Figure 1–figure supplement 2A” is shortened to “1–FS2A”). Figure Gene Expression Omnibus (GEO) ID, ArrayExpress ID, or data source Reference 1A, 1–FS4A GSM854425, GSM854430, GSM854431, GSM854436, GSM854437, GSM854442, GSM854443 (Bazzini et al., 2012) 1B, 6B GSM1012118, GSM1012119, GSM1012120, GSM1012121, GSM1012122, GSM1012123 (Loeb et al., 2012) 1C, 1–FS2A, 6C-D E-TABM-232 (Rodriguez et al., 2007) 1D, 1F GSM1122217, GSM1122218, GSM1122219, GSM1122220, GSM1122221, GSM1122222, GSM1122223, GSM1122224, GSM1122225, GSM1122226 (Helwak et al., 2013) 1E, 1–FS3A-D, 6I-J GSM538818, GSM538819, GSM538820, GSM538821 (Hafner et al., 2010) 1G GSM156524, GSM156532, GSM210897, GSM210898, GSM210901, GSM210903, GSM210904, GSM210907, GSM210909, GSM210911, GSM210913, GSM37599, http://psilac.mdc-berlin.de/download/ (let7b_32h, miR-30_32h, miR-155_32h, miR-16_32h) (Lim et al., 2005; Grimson et al., 2007; Linsley et al., 2007; Selbach et al., 2008) 1H, 6K-L E-MTAB-2110 (Tan et al., 2014) 1–FS1A GSM210897, GSM210898, GSM210901, GSM210903, GSM210904, GSM210907, GSM210909, GSM210911, GSM210913, GSM37599, GSM37601 (Lim et al., 2005; Grimson et al., 2007) 1–FS1B, 3, 3–FS1B-C, 4 74 datasets compiled in Supplementary data 4 of Garcia et al. (2011), used as is or after normalization (Supplementary file 1); GSM119707,GSM119708,GSM119710, GSM119743,GSM119745,GSM119746,GSM119747,GSM119749,GSM119750,GSM119759, GSM119761,GSM119762,GSM119763,GSM133685,GSM133689,GSM133699,GSM133700, GSM134325,GSM134327,GSM134466,GSM134480,GSM134483,GSM134485,GSM134511, GSM134512,GSM134551,GSM210897,GSM210898,GSM210901,GSM210903,GSM210904, GSM210907,GSM210909,GSM210911,GSM210913,GSM37599,GSM37601; E-MEXP-1402 (1595297366,1595297383,1595297389,1595297394,1595297399,1595297422, 1595297427,1595297432,1595297491,1595297496,1595297501,1595297507, 1595297513,1595297518,1595297524,1595297530,1595297535,1595297564, 1595297588,1595297595,1595297605,1595297614,1595297621,1595297627, 1595297644,1595297650,1595297662); E-MEXP-668 (16012097016666, 16012097016667,16012097016668,16012097016669,16012097017938, 16012097017939,16012097017952,16012097017953,16012097018568, 251209725411) (Lim et al., 2005; Birmingham et al., 2006; Jackson et al., 2006a; Jackson et al., 2006b; Schwarz et al., 2006; Grimson et al., 2007; Anderson et al., 2008) 1–FS1C GSM95614, GSM95615, GSM95616, GSM95617, GSM95618, GSM95619 (Giraldez et al., 2006) 1–FS1D-F GSM1269344, GSM1269345, GSM1269348, GSM1269349, GSM1269350, GSM1269351, GSM1269354, GSM1269355, GSM1269356, GSM1269357, GSM1269360, GSM1269361, GSM1269362, GSM1269363 (Nam et al., 2014) 1–FS2B, 1–FS4B, 6E GSM1479572, GSM1479576, GSM1479580, GSM1479584 (Eichhorn et al., 2014) 1–FS3E, 6H, S3E http://icb.med.cornell.edu/faculty/betel/lab/betelab_v1/Data.html (Lipchina et al., 2011) 1–FS4C http://psilac.mdc-berlin.de/download/pSILAC_all_protein_ratios_OE.txt (miR155) (Selbach et al., 2008) 3–FS1A GSM416753 (Mayr and Bartel, 2009) 5, 5–FS1 GSM156522, GSM156580, GSM156557, GSM156548, GSM156533, GSM156532, GSM156524, processed and normalized (Supplementary file 2) (Linsley et al., 2007) 6A GSM37601 (Lim et al., 2005) 6F-G GSM363763, GSM363766, GSM363769, GSM363772, GSM363775, GSM363778 (Hausser et al., 2009) 141 Table 3. Scaling parameters used to normalize data to the [0, 1] interval. Provided are the 5th and 95th percentile values for continuous features that were scaled, after the values of the feature were appropriately transformed as indicated (Table 1). Feature 8mer 7mer-m8 7mer-A1 6mer 5th % 95th % 5th % 95th % 5th % 95th % 5th % 95th % 3P_score 1.000 3.500 1.000 3.500 1.000 3.500 1.000 3.500 SPS –11.130 –5.520 –11.130 –5.490 –8.410 –3.330 –8.570 –3.330 TA_3UTR 3.113 3.865 3.067 3.887 3.145 3.887 3.113 3.887 Len_3UTR 2.392 3.637 2.409 3.615 2.413 3.630 2.405 3.620 Len_ORF 2.788 3.753 2.773 3.729 2.773 3.730 2.775 3.731 Min_dist 1.415 3.113 1.491 3.096 1.431 3.117 1.477 3.106 Local_AU 0.308 0.814 0.277 0.782 0.342 0.801 0.295 0.772 SA –4.356 –0.661 –5.218 –0.725 –4.230 –0.588 –5.082 –0.666 PCT 0.000 0.816 0.000 0.364 0.000 0.449 0.000 0.193 142 Chapter 3. Independent regulation of vertebral number and vertebral identity by microRNA-196 paralogs Siew Fen Lisa Wong1*, Vikram Agarwal2,3,4*, Jennifer H. Mansfield5,6, Nicolas Denans7, Matthew G. Schwartz5, Haydn M. Prosser8, Olivier Pourquié5,9, David P. Bartel2,3, Clifford J. Tabin5 and Edwina McGlinn1,5 1EMBL Australia, Australian Regenerative Medicine Institute, Monash University, Clayton, Vic, 3800, Australia. 2Howard Hughes Medical Institute and Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA. 3Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. 4Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. 5Department of Genetics, Harvard Medical School, Boston, MA 02115, USA. 6Barnard College, Department of Biological Sciences, 1306 Altschul Hall, 3009 Broadway, New York, NY, 10027. 7Stanford School of Medicine, Department of Developmental Biology and Genetics, Stanford, CA 94305. 8The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. 9Department of Pathology, Brigham and Women’s Hospital, Boston, MA 02115, USA. * These authors contributed equally to this work V.A. performed computational and statistical analysis with D.P.B.’s guidance. S.F.L.W., J.H.M, M.G.S., and E.M. performed mouse experiments. N.D. performed chick experiments with O.P.’s guidance. H.M.P. helped generate RNA sequencing data. C.J.T. and E.M. designed the study. V.A. and E.M. produced figures and wrote the manuscript. Published as: Wong SFL*, Agarwal V*, Mansfield JH, Denans N, Schwartz MG, Prosser HM, Pourquié O, Bartel DP, Tabin CJ, McGlinn E. "Independent regulation of vertebral number and vertebral identity by microRNA-196 paralogs". 2015. Proceedings of the National Academy of Sciences USA. doi: 10.1073/pnas.1512655112. 143 Abstract The Hox genes play a central role in patterning the anterior-to-posterior axis. An important function of Hox activity in vertebrates is the specification of different vertebral morphologies, with an additional role in axis elongation emerging. The miR-196 family of microRNAs are predicted to extensively target Hox 3′ UTRs, although the full extent to which miR-196 regulates Hox expression dynamics and influences mammalian development remains to be elucidated. Here we used an extensive allelic series of mouse knockouts to show that the miR-196 family of microRNAs is essential both for properly patterning vertebral identity at different axial levels and for modulating the total number of vertebrae. All three miR-196 paralogs, 196a1, 196a2 and 196b act redundantly to pattern the mid-thoracic region, whereas 196a2 and 196b have an additive role in controlling the number of rib-bearing vertebra and positioning of the sacrum. Independent of this, 196a1, 196a2 and 196b, act redundantly to constrain total vertebral number. Loss of miR-196 leads to a collective upregulation of numerous trunk Hox target genes with a concomitant delay in activation of caudal Hox genes which are proposed to signal the end of axis extension. Additionally, we identified altered molecular signatures associated with the Wnt, Fgf and Notch/segmentation pathways, and demonstrate that miR-196 has the potential to regulate Wnt activity by multiple mechanisms. By feeding into, and thereby integrating, multiple genetic networks controlling vertebral number and identity, miR-196 is a critical player defining axial formulae. Introduction A defining feature of the vertebrate animals is the presence of a segmented vertebral 144 column. Species are uniquely characterized by the total number of vertebra that form, and by the regionalization of these vertebra along the anterior-to-posterior axis into groups with distinct morphologies (e.g. cervical, thoracic, lumbar and sacral). The genetic determinants of vertebral number and vertebral identity have largely been considered as separate; thus how, or even whether, these processes are molecularly integrated remains to be clearly elucidated. Vertebral precursors, known as somites, arise by continued expansion and segmentation of a region of the caudal embryo, the presomitic mesoderm (PSM) (Benazeraf and Pourquie, 2013). Expansion of the PSM requires a self-renewing axial progenitor population which initially resides in the node-streak border of the epiblast, and subsequently repositions to the tailbud (Psychoyos and Stern, 1996; Cambray and Wilson, 2002, 2007; Iimura et al., 2007; McGrew et al., 2008). These progenitors provide a source of cells that, following ingression through the primitive streak, populate the PSM and other derivatives to drive posterior elongation. Key players in this process include genes involved in Wnt and Fgf signaling, in addition to the Cdx transcription factors, as evidenced by severe axis truncations when each are mutated (Neijts et al., 2014). Balancing the expansion of this cell population, cells of the anterior PSM bud off to form somites with a rhythmic periodicity inherent to each species. The eventual exhaustion of progenitor self-renewal capacity is thought to halt axis elongation, the timing of which is a critical factor in establishing species-specific vertebral number (Gomez et al., 2008). Within vertebral precursors, specific combinations of Hox transcription factors impart positional information that governs vertebral identity (Wellik, 2007). In mammals, 145 the 39 Hox genes are clustered at four separate genomic loci (HoxA, HoxB, HoxC and HoxD), with each gene classified into one of 13 paralogous groups dependent on sequence similarities and relative positions within the respective clusters (Fig. 1A). These genes are expressed in partially overlapping domains during embryonic development, with a spatio-temporal collinearity that reflects genomic ordering (Duboule and Dolle, 1989; Graham et al., 1989). Exhaustive analysis of Hox mouse mutants over more than 20 years has revealed individual and cumulative Hox function in conferring specific positional identities to the forming vertebral column (Wellik, 2007). For instance, the central/trunk Hox genes (paralogs 5–8) primarily pattern thoracic vertebra, whereas Hox 11 paralogs pattern sacral and caudal vertebra (Wellik and Capecchi, 2003) and position the sacrum (Favier et al., 1995; Favier et al., 1996). In addition to transcripts encoding for the Hox proteins, transcription within the genomic Hox clusters produce non-coding regulatory RNAs, including several microRNAs (Fig. 1A) (Heimberg and McGlinn, 2012). In mouse, these include the miR- 10 family, which is found throughout most bilaterian animals, miR-615, which his found in eutherian mammals, and the miR-196 family, which is found in vertebrates and tunicates. Three murine miR-196 paralogs exist (referred to as 196a1, 196a2 and 196b), each with essentially identical targeting potential (Yekta et al., 2004; Bartel, 2009). The three miR-196 paralogs exhibit deep conservation across all vertebrate lineages analyzed to date, both in terms of their genomic positioning upstream of Hox9 paralogs, and in their extensive predicted targeting of Hox 3′ UTRs primarily of the trunk region (Fig. 1A) (Yekta et al., 2004; Yekta et al., 2008; Vonk et al., 2013). In an early developmental context, in vivo validation of these interactions has focused primarily on a single Hox 146 target, Hoxb8 (Mansfield et al., 2004; Yekta et al., 2004; Hornstein et al., 2005; McGlinn et al., 2009; Asli and Kessel, 2010; He et al., 2011), with no evidence for additional Hox target regulation observed in miR-196 knockdown studies in zebrafish (He et al., 2011). Thus, the extent to which collective Hox output is regulated by miR-196, either in terms of the number of genes affected or the relative levels of regulation is unknown. The extent to which the developmental modules that define total vertebral number are integrated with those that impart positional information has not been well established, although these processes can be uncoupled (Dubrulle et al., 2001; Schroter and Oates, 2010; Harima et al., 2013). A function for Hox genes in establishing total vertebral number has been largely dismissed because, with the exception of Hoxb13–/– (Economides et al., 2003), Hox knockouts do not phenotypically support such a role (Wellik, 2007). However, ectopic trunk Hox activity can, under certain conditions, drive axis elongation (Young et al., 2009). Conversely, posterior Hox activity slows axis elongation and terminates the main body axis (Young et al., 2009; Denans et al., 2015), suggesting an alternative view of Hox activity in this context. In this light, phenotypic observations following reduced activity of miR-196, a repressor of Hox activity, are quite remarkable. Knockdown studies in chick and zebrafish support a role for miR-196 in regulating vertebral identity (McGlinn et al., 2009; He et al., 2011). Additionally, miR- 196 morphant zebrafish exhibit an extended vertebral column, with what appears to be an “insertion” of a rib-bearing precaudal element (He et al., 2011). How this latter phenotype arises developmentally is not known, and is difficult to reconcile with de- repression of trunk Hox target genes alone (Pollock et al., 1992; Pollock et al., 1995). These knockdown approaches could not shed light on individual paralog contributions for 147 this highly related miRNA family, and importantly, the molecular networks downstream of miR-196, which have the potential to drive phenotypic alterations, remain uncharacterized. Here, we have generated individual knockout alleles for each of the three miR- 196 family members in mouse. This has allowed us to build an entire allelic deletion series to reveal the individual and additive roles of miR-196 paralogs in patterning vertebral identity at many axial levels and in controlling the total number of vertebrae. We have characterized the detailed molecular landscape controlled by miR-196 activity in the early embryo to show that miR-196 regulates, and therefore has the ability to integrate, multiple key signaling pathways to drive developmental processes. Results Differential transcription of miR-196a1 and miR-196a2 in the developing embryo. To reveal the individual expression patterns, and therefore potential for functional redundancy, of identical miRNAs 196a1 and 196a2, we generated eGFP knock-in alleles termed 196a1GFP and 196a2GFP (Fig. S1). Expression of reporter mRNA reflects sites of active transcription, though does not reveal additional post-transcriptional regulation that endogenous miRNAs may undergo. Whole mount in situ hybridization analysis of reporter mRNA indicated that both miRNAs were expressed specifically in the posterior embryonic derivatives of all three germ layers, and revealed striking differences in their spatio-temporal kinetics that have not previously been delineated (Fig. 1B-J). miR-196a1 is expressed at the onset of somitogenesis [embryonic day (E) 8.0; data not shown] and throughout the posterior growth zone at E8.5 (Fig. 1B). Strong expression is maintained 148 in the PSM until the end of axis elongation, with a discrete band of low expression in the anterior PSM from E10.5 (see inset Fig. 1E and F). The anterior boundary of somitic and neural expression extends to approximately somite 13/14 [prevertebra (pv) 9, thoracic (T) 2] at E9.5 with a caudal shift in somitic tissue and a rostral shift in neural tissue as development proceeds (Fig. 1C and D). This expression profile indicates that miR-196a1 exhibits a classic collinear profile relative to the adjacent Hox gene, Hoxb9 (anterior limit at E9.5, pv3) (Chen and Capecchi, 1997). miR-196a2 expression is temporally delayed relative to miR-196a1, with faint expression ventral to the PSM at E8.5-9.0 (arrows in Fig. 1F and G). Strong expression is then observed throughout the PSM and neural plate at E9.5 (Fig. 1H). A stable anterior somitic limit at approximately somite 21/22 (pv17, T10) and neural limit 2 somites rostral to this is established soon after, consistent with its positioning between Hoxc9 and Hoxc10 (Burke et al., 1995). This analysis revealed both unique and overlapping expression patterns of miR-196a1 and miR-196a2, suggesting these identical miRNAs might have both unique functions where individually expressed and either redundant or additive functions at sites of co-expression. Genetic deletion of miR-196 leads to altered vertebral identity The collective function of miR-196 family members had yet to be assessed in mammals. Moreover, the dissection of paralog contributions to overall miR-196 activity had not been achieved in any system. To address this, we generated straight knockout alleles at each of the three murine miR-196 loci (Fig. S2), allowing us to create the complete allelic series of single, double and triple miR-196 knockout embryos. This allowed us to demonstrate an essential requirement for miR-196 activity in patterning the mid-thoracic, 149 the thoraco-lumbar transition and lumbo-sacral regions, with both paralog-specific and additive effects revealed. Removal of individual miR-196 paralogs alone revealed partially penetrant homeotic patterning defects (Fig. 2A, Table S1). In 196a2 or 196b single-mutant embryos, the presence of an ectopic rudimentary rib nubbin on the first lumbar vertebra indicated an anterior homeotic transformation of this element (Fig. 2A). Additionally, in approximately one quarter of cases, we observed anterior homeotic transformations encompassing all subsequent lumbar and sacral elements, resulting in a posterior displacement of the sacrum (schematized in Fig. 2D). Although this latter phenotype could be interpreted as an “insertion” of a thoracic element, the repositioned last lumbar vertebrae (L6* in Fig. 1l) was often asymmetric, with both lumbar and sacral characteristics (Table S1), which supports the interpretation of serial identity changes, beginning at L1 and encompassing all subsequent elements. We did not observe a similar L1-to-T anterior homeotic transformation in 196a1 single-mutant embryos, which for the most part, exhibited no overt vertebral alterations (Fig 2A). However, at very low penetrance (Table S1), 196a1 single-mutant embryos displayed an anterior displacement of the sacrum, with or without a reduction in rib length of the last thoracic element (T13), suggesting these paralogs may have an opposing role at this axial level. We hypothesized that the penetrance and severity of the phenotypes observed after mutating single miR-196 paralogs could be enhanced by combining these mutations. Indeed, 196a2–/–;196b–/– double-mutant skeletons exhibited a fully penetrant phenotype, with two pairs of supernumerary ribs and anterior homeotic transformation of all subsequent elements (Fig. 2A,B,D). Relative to this double mutant phenotype, triple 150 knockout embryos, 196a1–/–;196a2–/–;196b–/–, displayed no additional patterning defects (Fig. 2A,D). We also hypothesized that combining these mutations might reveal additional defects not observed in single mutants. Indeed, all double-mutant skeletons, or skeletons with a triple-knockout combination of 4 or more alleles removed, exhibited a partially penetrant increase in the number of ribs attached to the sternum (Table S1, Fig. 2C) indicating a transformation of the 8th thoracic element to a more anterior identity. Together, our analysis has shown that 1) 196a2 and 196b have single and additive effects in patterning the thoraco-lumbar transition and in positioning the sacrum, with a possible opposing role or miR-196a1 at this axial level; 2) miR-196a1, miR-196a2 and miR-196b act redundantly to pattern the mid-thoracic region, with phenotypic alterations observed only when two or more paralogs are removed. As such, our work has provided the first genetic proof for miR-196 as a homeotic family of genes and revealed identity changes at multiple axial levels. Genetic deletion of miR-196 leads to an increase in vertebral number Homeotic transformations do not alter the number of vertebra, simply their identity. It was therefore surprising, that in zebrafish, miR-196 has been shown to constrain total vertebral number (He et al., 2011). We assessed whether this was an evolutionarily conserved function of miR-196, and found that the three murine miR-196 paralogs constrain total vertebral number in a redundant fashion. Wildtype C57BL6J/N mice exhibit small variations in the total number of vertebrae (Fig. 2E). Compared with the wildtype mean, we observed a statistically significant increase of approximately one 151 vertebral element in various allelic combinations, including 196a1–/–;196a2–/–, 196a2–/– ;196b–/– and triple-knockout combinations with four or more alleles deleted (Fig. 2E). Depending on the exact allelic combination, this additional element was patterned as a thoracic (e.g. in 196a2–/–;196b–/– mice) or a post-sacral (e.g. in 196a1–/–;196a2–/– mice) element. Together, these results indicate that miR-196-mediated control of vertebral number and patterning of segment identity are separable processes. All three miR-196 paralogs contribute additively to establishing vertebral number within mouse. Layered on top of this, individual miR-196 paralogs have a differential impact on positional identity and ultimate axial formulae, likely as a result of their differential spatio-temporal kinetics (Fig. 1B-K) relative to target mRNAs. Transcriptome alterations are detected following allelic removal of miR-196 activity To elucidate the molecular mechanism and targets downstream of miR-196, we examined the response of mRNAs to the loss of mir-196 alleles in E9.5 embryos. To focus these molecular analyses on the relevant cells, i.e., those cells that normally express miR-196, we used only embryos with at least one eGFP knock-in allele and performed RNA-seq on RNA isolated from cells that were GFP positive (Fig. 3A). With mRNA profiled across ten genotypes (Table S2), we then compared mRNA changes as increasing numbers and differing combinations of alleles were deleted (Table S3). We first examined the effect of allelic miR-196 deletion on predicted miR-196 target genes. Utilizing the total context+ score from TargetScan 6.2, which considers the number and type of miRNA binding sites as well as additional features to predict the genes most effectively targeted by each miRNA (Garcia et al., 2011), we observed that the top predicted targets of miR-196 152 exhibited significant de-repression upon the loss of additional miR-196 alleles (Fig. 3B, Figure S4). The de-repression of these predicted targets increased with the number of additional alleles deleted (Fig. 3B), revealing miR-196 dosage sensitivity. The direct interaction between miR-196 and its target transcripts could occur in any of the three germ layer derivatives in which miR-196 was expressed, and indeed, an unbiased analysis of all differentially expressed genes revealed statistically altered molecular signatures reflecting this (Fig. S5). Of particular interest, we observe statistical enrichment in genes controlling skeletal morphology (Fig. 3C,D and Fig. S5), indicating the presence of a molecular signature consistent with the vertebral abnormalities observed at the phenotypic level. Hox cluster expression dynamics are altered in miR-196 mutant embryos It was not known exactly how many of the ten predicted murine miR-196 Hox target genes are in fact bona-fide targets in an in vivo developmental context, nor was it known the relative level of regulation that these predicted targets undergo. When specifically interrogating our transcriptome datasets to assess effects on Hox gene expression, a significant and dose-dependent upregulation of predicted miR-196 Hox targets was observed (Fig. 4A), which paralleled the dose-dependent patterning defects (Fig. 2A). Comparison of 196a2–/–;196b–/– versus 196a2+/– profiles identified 7/10 predicted miR- 196 Hox targets as significantly de-repressed in double-mutant cells at this developmental stage. Those predicted Hox targets exhibiting no significant de-repression in our analysis included Hoxb1, Hoxa4 and Hoxa5. The most highly de-repressed Hox targets were Hoxc8 and Hoxa7, both of which harbor multiple predicted miR-196 binding 153 sites in their 3′ UTRs, and Hoxb8, which exhibits unusually extensive complementarity to miR-196 (Mansfield et al., 2004; Yekta et al., 2004). Further, the measurement of differential expression (Fig. 4A) were almost certain to be an underestimate, since our strategy Utilized eGFP-positive control samples in which at least one miR-196 allele had been removed. Whole-mount in situ hybridization (WISH) further revealed that the de- repression of Hoxb8 and Hoxc8 target transcripts in 196a2–/–;196b–/– E9.5 embryos relative to wildtype manifested as a posterior expansion of endogenous expression domains in both the PSM and neural tube (Figure 4B,C; n = 3/3 per genotype, respectively). In light of previous reports (Pollock et al., 1992; Pollock et al., 1995), this failure in timely clearance of the trunk Hox program from more posterior locations is likely to drive supernumerary rib formation observed in miR-196 mutant embryos. Importantly, we also identified a dose-dependent downregulation of posterior Hox genes following progressive removal of miR-196 alleles (Fig. 4A). This was particularly evident for Hoxd10-d13 genes, and was also significant for posterior genes of the HoxA and HoxC clusters. Although the absence of predicted miR-196 sites within these mRNAs, together with the direction of the regulation (down instead of up with diminished miRNA) indicated that this regulation was indirect, it was nonetheless notable for three reasons. First, given the potential for phenotypic dominance of posterior over anterior Hox gene function (e.g. rib-suppression role of Hox10 paralogs (Wellik and Capecchi, 2003; Carapuco et al., 2005), a timely activation of a posterior developmental program in miR-196 mutants would be expected to suppress supernumerary rib formation. Second, these posterior Hox proteins, particularly Hoxd11 and Hoxa11, are known to position the lumbo-sacral junction (Davis and Capecchi, 1994; Favier et al., 154 1995; Spitz et al., 2001), providing a molecular explanation for how the sacrum was re- positioned in miR-196 mutants. Finally, in addition to understanding vertebral identity defects, these molecular alterations may provide important experimental support for a proposed model whereby maintenance of tailbud cell divisions, and therefore total vertebral number, is promoted by trunk Hox genes and antagonized by caudal Hox genes (Economides et al., 2003; Young et al., 2009; Denans et al., 2015). Our results place mir- 196 activity at this critical junction, coordinating a reproducible trunk-to-tail Hox code transition. We suggest that such a delay in Hox-code transition could contribute to the formation of an additional vertebral element observed following genetic removal of miR- 196 activity in mouse. This is likely to be a broadly conserved role for miR-196 across vertebrate species, as supported by regionalized vertebral expansion observed in zebrafish (He et al., 2011). Identification of additional direct targets of miR-196 The statistical enrichment of Hox genes amongst all miR-196 predicted targets (Yekta et al., 2008) prioritized these mRNAs for immediate analysis. However, microRNAs can simultaneously repress extensive suites of target genes (Bartel, 2009). To provide experimental support for additional direct targets of miR-196 that have the potential to function in this developmental context, we identified the most highly up-regulated genes in our RNA-seq dataset that either contained a conserved binding site or were predicted to respond strongly to the miRNA (i.e., context+ score ≤ –0.2) (Fig. 5A). For the top three evolutionarily conserved miR-196 target genes identified, we assessed whether regulation of their expression by miR-196 required direct binding to sites within their 3′ UTR. Using 155 a luciferase-based reporter assay system in cell culture, miR-196 was shown to repress each of the target genes in a sequence-specific manner (Fig. 5B). Of particular interest within this set was the cell-adhesion molecule (Prtg) involved in the ingression of PSM progenitors (Ito et al., 2011), and an orphan nuclear receptor (Nr6a1) essential for somitogensis in mouse (Chung et al., 2001) and one of the very few genes that have be associated with variation of vertebral number (Mikawa et al., 2007). These experimentally supported miR-196 targets highlight important avenues for future investigation, not only with respect to axial patterning and elongation but also the many other developmental processes (Hornstein et al., 2005; Asli and Kessel, 2010; He et al., 2011) and pathological conditions (Li et al., 2012; Velu et al., 2014) involving miR-196. miR-196 activity is required for signaling pathways associated with axis elongation, segmentation and the trunk-to-tail transition. miR-196 activity has been shown to negatively regulate retinoic acid pathway activity in the context of pectoral fin formation (He et al., 2011), but regulation of additional developmental signaling pathways in the early embryo, either directly or indirectly, has not been systematically assessed. Upon further interrogation of our RNA-seq data, we found altered molecular signatures of both axis elongation and somite segmentation across many allelic comparisons (Fig. 6). We observed a clear upregulation of the Wnt negative feed-back inhibitor Dkk1 (Chamorro et al., 2005). In addition, the collective down-regulation of numerous direct and indirect downstream targets of Wnt signaling (Takahashi et al., 2002; Buttitta et al., 2003; Lickert et al., 2005; Weidinger et al., 2005; Dequeant et al., 2006) (Fig. 6), and the prediction of diminished β-catenin/CTNNB1 156 activity following global pathway analysis (Fig. S6), indicated an overall reduction in Wnt activity in mutant embryos. Wnt and Fgf signaling positively reinforce one another in the mouse tailbud (Aulehla et al., 2008; Dunty et al., 2008; Naiche et al., 2011), and consistent with diminished Wnt activity in miR-196 mutants, we also observed a downregulation of the Fgf8 ligand and numerous Fgf downstream effectors (Fig. 6). We observed a robust down-regulation of Notch signaling components and anterior PSM genes Mesp2, Epha4 and Ripply2, likely as a consequence of diminished Wnt activity acting via the Notch ligand Dll1 (Galceran et al., 2004; Hofmann et al., 2004; Dunty et al., 2008). Interestingly, these molecular alterations described for miR-196 mutant embryos resembled alterations observed following removal of all mature miRNAs in the mesoderm lineage (Zhang et al., 2011), which in the latter case resulted in a caudal displacement of the hindlimb by 3 somites. Finally, a coordinated temporal delay in the trunk-to-tail Hox code transition has been observed in mice null for Gdf11 (McPherron et al., 1999), which as heterozygotes, bear striking phenotypic resemblance to 196a2–/–;196b–/– or miR-196 triple knockout mouse embryos. We therefore specifically interrogated our RNA-seq data to assess the levels of Gdf11 and its direct downstream effector Isl1 (Jurberg et al., 2013). In 196a2–/– ;196b–/– embryos, which exhibit 100% penetrant L-to-T transformation and sacral displacement, we observed a statistically significant reduction in Gdf11 and Isl1 levels by 15% (Table S3). As mentioned, this is likely to be an underestimate of the level of regulation, given the experimental strategy employed. The requirement for Gdf11 in defining presacral vertebral number is dose-dependent (McPherron et al., 1999). The exact threshold requirement for Gdf11 signaling is not known, and it remains to be 157 determined whether subtle down-regulation of Gdf11 contributes to phenotypic alterations observed in miR-196 mutant mice. Together, our transcriptome analysis revealed multiple developmental networks that require miR-196 activity for appropriate control of gene expression and suggest intriguing avenues for future experimental exploration. miR-196 has the potential to modulate Wnt signaling by multiple mechanisms Vertebral progenitors in the epiblast and tailbud are sensitive to the levels of Wnt signaling. Genetic removal of the Wnt3a ligand (Takada et al., 1994), or conversely, ectopic activation of Wnt3a in the epiblast (Jurberg et al., 2014), result in severe axis truncation posterior to the forelimb. Wnt3a expression has been shown to decrease as progenitor cells commit to a paraxial mesoderm fate (Takemoto et al., 2011; Nowotschin et al., 2012), and sustained Wnt activity disrupts somite formation (Aulehla et al., 2008) and somite polarity (Jurberg et al., 2014), dependent on timing and method of activation. These observations indicate that careful titration of Wnt levels is essential throughout the process of somite formation. Our data suggests that miR-196 activity is required in maintaining precise levels of Wnt activity (Fig. 6). Mechanistically, this could be achieved in at least two ways. First, miR-196 could directly target genes in the Wnt pathway. Specifically, the potent Wnt antagonist Dkk1 harbors a single predicted miR- 196 site within its 3′ UTR, and Dkk1 expression was upregulated following removal of miR-196 activity (Fig. 6). Using WISH, we confirmed increased expression of Dkk1 in 196a1–/–;196a2–/– embryos relative to 196a1+/–;196a2+/– (Fig 7A; n = 2/2 per genotype). To test whether miR-196 can act directly to repress Dkk1, we Utilized a luciferase-based 158 reporter assay system in cell culture to show that, indeed, miR-196 negatively regulates the Dkk1 3′ UTR in a sequence-specific manner (Fig. 7B). However, the repression in the reporter assay was more modest than that observed in vivo using RNA-seq (Fig. 6), and Dkk1 is not a conserved target of miR-196, suggesting that indirect regulation by miR- 196 also plays role. Second, miR-196 control over Wnt activity might work in part via Hox intermediates, which have the potential to either activate or repress Wnt signaling (Young et al., 2009; Denans et al., 2015). We have recently shown using chick in vivo electroporation and imaging that the collinear activation of a subset of Hox9-13 posterior Hox genes within paraxial mesoderm progenitors translates into a graded increase in Wnt repression and a slowing down of axis elongation (Denans et al., 2015). One Hox gene that was found to significantly repress Wnt activity using this in vivo luciferase-based Wnt reporter assay was the miR-196 target Hoxa9. We therefore went on to test whether additional miR-196 Hox targets have the ability to repress Wnt activity in this context. We co-electroporated a Wnt/β-catenin reporter (BATLuc) and a CMV-Renilla construct in paraxial mesoderm progenitors together with an expression vector containing either Venus or Hoxb1, Hoxa5, Hoxa7, Hoxb7, Hoxb8 and Hoxc8. Of these six Hox genes tested, four (Hoxa7, Hoxb7, Hoxb8 and Hoxc8) showed strong repression of luciferase activity, while two (Hoxb1 and Hoxa5) did not (Fig 7C). Interestingly, the two Hox genes do not influence Wnt/β-catenin reporter activity in early chick paraxial mesoderm progenitors are the same Hox genes which show no indication of direct regulation by miR-196 in E9.5 mouse tissue (Fig. 4A). Together, these data demonstrate that miR-196 has the potential to directly and indirectly regulate the precise levels of Wnt activity in the developing embryo. 159 Discussion Our work demonstrates the essential role for murine miR-196 in regulating vertebral identity across different levels of the body axis, and reveals evolutionary conservation in the role of miR-196 in constraining total vertebral number. Importantly, our strategy has allowed us to comprehensively dissect paralog contribution to resultant phenotypes, allowing us to distinguish a patterning role for miR-196 from its role in modulating vertebral number. Moreover, we have characterized the detailed molecular landscape controlled by miR-196 activity in the early embryo to show that miR-196 regulates, and therefore has the ability to integrate, multiple key signaling pathways to drive developmental processes. miR-196 activity is essential for vertebral identity Despite the clear potential for functional redundancy between miR-196 paralogs (Yekta et al., 2004), homeotic transformation of vertebral elements could be observed at low penetrance following removal of an individual miR-196 paralog (e.g. 196a2–/– or 196b–/– single mutants). With increasing loss of miR-196 family members (e.g. 196a2–/–;196b–/– double mutants), fully penetrant vertebral phenotypes were observed that were equivalent in severity to many single and compound Hox mutants (Favier et al., 1996; van den Akker et al., 2001). Vertebral identity changes were observed at sites where loss-of- function phenotypes have previously been described for numerous direct targets of Hox genes (van den Akker et al., 2001), reinforcing the view that miR-196 acts within endogenous Hox domains rather than simply as a fail-safe mechanism to clear an anterior developmental program at more posterior locations (McGlinn et al., 2009). Paradoxically, 160 the 196a2–/–;196b–/– or triple knockout phenotypes are remarkably similar to either Hoxc8–/– or Hoxc8–/–;Hoxd8–/– skeletons, with 8 ribs attached to the sternum, L1-to-T transformation and a posterior displacement of the sacrum (van den Akker et al., 2001). However, with respect to number of sternal rib attachments and L1-toT transformation, Hoxc8 loss-of-function and gain-of-function mutant mice exhibit identical phenotypes (Pollock et al., 1992; van den Akker et al., 2001). These data indicate that exquisite regulation of a quantitative Hox code is essential in defining vertebral identity at this axial location. Interestingly, deletion of Hoxb8 rescues many defects observed in Hoxc8 null mice, highlighting that there are aspects of a qualitative Hox code that we are yet to understand. Nonetheless, similar to Hoxc8, ectopic Hoxb8 expression results in supernumerary rib formation throughout the lumbar region (Pollock et al., 1995), supporting the view that a collective up-regulation of direct targets of Hox genes drives homeotic alterations of the mid-thoracic to upper lumbar region in miR-196 mutant mice. A shift in the position of the sacrum observed in miR-196 mutant embryos was not easily reconcilable with the function of miR-196 in directly repressing trunk Hox target genes (Pollock et al., 1992; Pollock et al., 1995). However, we show that in addition to direct Hox gene regulation, miR-196 indirectly regulates the expression levels or temporal activation of many caudal Hox genes, including those that are known to control positioning of the sacrum, such as Hoxa10, Hoxd10 and Hoxd11 (Davis et al., 1995; Favier et al., 1995; Favier et al., 1996; Zakany et al., 1997). The mechanisms leading to a delay in posterior Hox gene activation in miR-196 mutant mice are currently unknown. A similar coordinated temporal shift in the trunk-to-tail Hox code has been demonstrated in Gdf11–/– mice (McPherron et al., 1999), which show conservation in the 161 types of vertebral transformations we observe here in miR-196 mutant embryos. In this context, Gdf11 appears to work via retinoic acid signaling (Lee et al., 2010; Jurberg et al., 2013), and whether altered Gdf11 and retinoic acid signaling contribute to miR-196 phenotypic alterations remains to be tested. miR-196 activity constrains total vertebral number Total vertebral number of a given species is highly reproducible, and mutations that extend the vertebral column of model organisms are very rare. Amongst vertebrate species, however, great diversity in vertebral number has arisen. Cross-species comparison (Gomez et al., 2008) or direct genetic perturbation (Dubrulle et al., 2001; Schroter and Oates, 2010) demonstrate that the periodicity of segmentation clock oscillation relative to the rate of PSM growth is the central parameter in defining vertebral number. It remains to be determined how an additional vertebral element seen here in miR-196 mutant mice, or in miR-196 morphant zebrafish (He et al., 2011), are generated at a cellular level (i.e., does the clock tick faster, or does it tick at the same rate for longer). Our analysis does however reveal molecular alterations in miR-196 mutant embryos which have the potential to affect vertebral number. First, altered expression of Notch, Wnt and Fgf pathways could alter the periodicity of segment formation (Benazeraf and Pourquie, 2013). However, diminished Wnt and Fgf would be predicted to increase somite size (Dubrulle et al., 2001; Sawada et al., 2001; Aulehla et al., 2003; Bajard et al., 2014), which if axis elongation was unaltered, would lead to a reduction in vertebral number. Further work is required to clarify any functional role for miR-196 in the molecular networks coordinating 162 segmentation. Second, we have shown that miR-196 activity can modulate the expression levels of many Hox genes, either directly or indirectly. It is well documented that Hox genes control mesodermal ingression, thus regulating cell injection into the PSM (Iimura and Pourquie, 2006; Denans et al., 2015). The rate of PSM growth is not uniform along the A-P axis (Gomez et al., 2008), with a switch to PSM shortening occurring at about the trunk-to-tail transition in most amniotes. This switch correlates with activation of a posterior Hox code (Hox9 onwards), and a subset of posterior Hox genes slow axis elongation by controlling the ingression of PM progenitors via Wnt repression (Denans et al., 2015). We show here that the ability to repress Wnt signaling is not exclusive to posterior Hox genes, but that Hox7/8 paralogs also downregulate Wnt signaling in a collinear manner in the chick epiblast. This fits well with previous observations that Hoxb7 and Hoxb9 have a collinear effect on ingression (Iimura et al., 2007). The role of this repression might be to help maintain cells with progressively more posterior identity in the epiblast, in order to get a progressive deposition of collinear Hox domains. A delay in posterior Hox gene activation would result in delayed commencement of axis elongation slow-down, potentially allowing the formation of additional vertebral elements. The repression of Wnt by posterior Hox genes as a means to slow down and terminate axis elongation (Young et al., 2009; Denans et al., 2015) is consistent with the known function of Wnt3a in driving axis elongation (Takada et al., 1994). The repression of Wnt by trunk Hox genes is less intuitive, and not consistent with a study in mouse (Young et al., 2009). However, the importance of precise Wnt levels in the early steps of axis formation, and of cellular context, are beginning to be appreciated. In vitro analysis 163 of epiblast stem cells demonstrate that low levels of Wnt induce a primitive streak-like pluripotent state, whereas higher levels of Wnt activity promoted lineage commitment (Tsakiridis et al., 2014). Additionally, high levels of Wnt3a in the mouse epiblast appear to exhaust the progenitor pool (Jurberg et al., 2014). It is therefore possible that, in vivo, as the axis is rapidly elongating and Wnt activity is already high, the trunk Hox7/8 genes negatively feedback on Wnt activity to avoid the immediate depletion of the pool of progenitors by ingression and hence to regulate the progressive formation of the axis. Although a heterochronic shift in the trunk-to-tail Hox code transition could be predicted to vary vertebral number, morphological evidence for this has been scarce. Analysis of total vertebral number in Gdf11–/– mice, which exhibit a dramatic heterochronic shift in Hox code, is hampered by caudal truncation (McPherron et al., 1999). Although ectopic trunk Hox gene expression (Hoxa5 and Hoxb8) has the ability to rescue axis truncation defects of a genetically engineered mutant (Young et al., 2009), they do not appear to increase vertebral number on a wildtype background (Pollock et al., 1992; Young et al., 2009). This is possibly due to the fact that posterior prevalence still holds; caudal Hox genes and miR-196 would be expressed at the usual time and place to regulate and terminate axis elongation. In the case of miR-196 knockouts, the cumulative effect on both trunk and caudal Hox gene expression could permit continued maintenance of progenitor divisions whilst delaying commencement of axis elongation slow-down, resulting in increased vertebral number. Together, our results highlight an essential requirement for miR-196 activity in reinforcing a timely trunk-to-tail transition and reproducibility of axial formulae. Given the ancestral role of Hox activity in species that Utilize a posterior growth zone (Ryan 164 and Baxevanis, 2007), and the recurrent acquisition of miRNAs within the Hox clusters across metazoan taxa (Lagos-Quintana et al., 2003; Yekta et al., 2004; Heimberg and McGlinn, 2012; Moran et al., 2014), variation in Hox-miRNA interactions may represent an important mechanism for the evolution of animal body plans. Materials and Methods miR-196a1GFP and miR-196a2GFP knock-in construction A 72bp (miR-196a1) or 52bp (miR-196a2) genomic fragment encompassing each mature miRNA sequence was replaced with a cassette containing eGFP fused to the rabbit β- Globin 3′ UTR followed by FRT-flanked PGKem7-Neomycin. A Kozak sequence was inserted upstream of the eGFP start codon. Targeting constructs were generated using 129/Sv sequence and electroporated into J1 embryonic stem cells. Correctly targeted ES cells were identified and used to generate germline transmitting knock-in lines. Prior to analysis, the Neomycin selection cassette was removed by crossing to a ubiquitous FLPe- deleter mouse line. Resulting lines were bred onto a C57Bl/6J background and confirmed as isogenic by SNP genotyping. miR-196a1–/– and miR-196a2–/– and miR-196b–/– generation Previously targeted ES cells at each of the three miR-196 loci have been generated (Prosser et al., 2011). Correctly targeted JM8A3 ES cells were reconfirmed by Southern blot and used to generate germline transmitting knockout lines. Prior to analysis, the puDeltaTK selection cassette was removed by crossing to a ubiquitous Cre-deleter mouse. Resulting lines are on a mixed C57Bl/6J and C57Bl/6N background. 165 Mouse skeletal preparation and analysis Skeletal preparation was performed on E18.5 embryos or p0 postnatal pups as previously described (McLeod, 1980). In situ hybridization Whole mount in situ hybridization was performed as previously described (McGlinn and Mansfield, 2011). FACS sorting and RNA-seq sample preparation Freshly dissected E9.5 embryos were dissociated in 0.25%Trypsin/2% chick serum, neutralized in DMEM+10% FBS and washed into PBS+2% FBS for FACS sorting. GFP positive cells were FACS sorted directly into RTL buffer (Qiagen) and RNA isolated using RNEasy with added on-column DNase treatment (Qiagen). RNA quality was assessed using a Bioanalyzer and 200 ng per individual embryo was used as input for RNA-seq library generation (unstranded Illumina TruSeq Kit). Libraries were multiplexed and sequenced using an Illumina HiSeq 2000 instrument, generating 50bp single end reads. RNA-seq and category enrichment analysis Quantification of the transcriptome using RNA-seq data was performed as previously described (Denzler et al., 2014). Raw reads were aligned to the latest build of the mouse genome (mm10) using STAR v. 2.3.1n (options --outFilterType BySJout -- outFilterMultimapScoreRange 0 --readMatesLengthsIn Equal --outFilterIntronMotifs 166 RemoveNoncanonicalUnannotated --clip3pAdapterSeq TCGTATGCCGTCTTCTGCTTG --outSAMstrandField intronMotif --outStd SAM) (Dobin et al., 2013). Considering all replicates of a particular genotype, differential expression statistics were computed between genotypes of interest using cuffdiff v. 2.1.1 (options --library-type fr-unstranded -c 100 -b mm10.fa -u --max-bundle-frags 100000000) (Trapnell et al., 2013), using mouse transcript models of protein-coding genes annotated in Ensembl release 72. Before all subsequent analysis, we filtered away genes annotated by cuffdiff as “NOTEST” in all genotypes, indicating the genes were too lowly expressed to accuracy quantify their abundances. To evaluate functional gene categories that were statistically enriched, we loaded differentially expressed genes (i.e. genes with a Q value < 0.05) into the Core Analysis function of IPA software (Ingenuity Systems), testing gene categories related to development and function. All P values reported from this analysis were adjusted using the Benjamini-Hochberg method to control the false discovery rate. miRNA target analysis To identify predicted miRNA targets, the 3′ UTR sequences of protein-coding genes were searched to identify 6mer, 7mer-A1, 7mer-m8 and 8mer miRNA binding sites cognate to the miR-196 seed (Grimson et al., 2007; Garcia et al., 2011). A context+ score was computed for each target site within a given 3′ UTR, and scores were summed to produce a total context+ score for each gene, which was used for all miRNA-related analyses (Garcia et al., 2011). TargetScanMouse 6.2 was further utilized to assess target site conservation, or to include predicted miR-196 targets containing non-canonical 3′ 167 compensatory sites, such as in the case of Hoxb8 (Friedman et al., 2009). Permutation test for significance testing A permutation test was devised to evaluate the significance of differences in vertebral number. Briefly, given two groups of count-based data of size n and m, we randomly partitioned the counts (without replacement) from the union of the two groups to generate 100,000 pairs of data, again of size n and m. To compute an empirical one-sided P value, we then computed the proportion of pairs that satisfied the condition that the difference in the means of each pair exceeded the difference in means of the original two groups. In vitro luciferase assay 3′ UTR sequence (300-700 nucleotides) of protein-coding genes of interest were commercially synthesized and cloned into psiCheck2 vector. For each, a mutant version containing 4 nucleotide substitutions within the miR-196 seed sequence were generated. Constructs were transfected into NIH3T3 cells with or without 25pmol mmu-miR-196b duplex. Transfection (Lipofectamine2000, Life Technologies) and luciferase analysis (Dual Luciferase Reporter Assay System, Promega) were performed as per the manufacturer’s instructions. Chick electroporation and in vivo BatLuc reporter analysis Chicken embryos were harvested at stage 5 HH (Hamburger and Hamilton, 1951) and electroporated ex ovo as described (Denans et al., 2015) with a DNA mix containing BATLuc (1 µg/µL final), CMV-Renilla (Promega) (used as a control to normalize the 168 differences of electroporation intensity between embryos (0.2 µg/µL final), a control pCAGGS-Venus vector (gift from K. Hadjantonakis) or a Hox gene of interest (Hoxb1, a5, a7, b7, b8 or c8) cloned in pCAGGS-IRES2-Venus (5 µg/µL final). Electroporated embryos were cultured in a humidified incubator at 38°C for 20 h. Embryos were analyzed using a fluorescent microscope and only embryos showing restricted expression of Venus in the paraxial mesoderm were selected (90 to 100% of the electroporated embryos) for luciferase assay (between 3 to 5 embryos for each condition). The posterior region (from somite 1 to tail-bud) of the selected embryos was dissected and lysed in passive lysis buffer (Promega) for 15 minutes at room temperature. Lysates were then distributed in a 96 well plate and luciferase assays were performed using a Centro LB 960 luminometer (Berthold Technology) and the dual luciferase kit (Promega) following manufacturer’s instructions. Raw intensity values for Firefly luciferase signal were normalized with corresponding Renilla luciferase values (RLU) and the control experiment was set to 1. Acknowledgements We thank Xin Sun and Denis Duboule for supplying in situ probes and A. Dobin for help in understanding RNA-seq mapping parameters. We thank Allan Bradley for providing three knockout mouse ES cell lines used in these studies. We thank Christophe Marcelle, Eran Hornstein and Jan Manent for critical reading of the manuscript. This work was supported by an NSF Graduate Research Fellowship (V.A.), NIH grant GM067031 (D.B.), NIH grant R37HD032443-19 (C.J.T.) and NH&MRC Project Grant APP1051792 (E.M.). E.M. thanks Bioplatforms Australia for support. D.B. is a Howard Hughes 169 Medical Institute Investigator. The Australian Regenerative Medicine Institute is supported by grants from the State Government of Victoria and the Australian Government. The data reported in this paper are compiled in Supplementary Information; all raw and processed RNA-seq data are deposited in the NCBI Gene Expression Omnibus (GEO) under accession number GSE53018. The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to edwina.mcglinn@emblaustralia.org. 170 References Asli, N.S., and Kessel, M. (2010). Spatiotemporally restricted regulation of generic motor neuron programs by miR-196-mediated repression of Hoxb8. Developmental biology 344, 857-868. Aulehla, A., Wehrle, C., Brand-Saberi, B., Kemler, R., Gossler, A., Kanzler, B., and Herrmann, B.G. (2003). Wnt3a plays a major role in the segmentation clock controlling somitogenesis. Developmental cell 4, 395-406. Aulehla, A., Wiegraebe, W., Baubet, V., Wahl, M.B., Deng, C., Taketo, M., Lewandoski, M., and Pourquie, O. (2008). A beta-catenin gradient links the clock and wavefront systems in mouse embryo segmentation. Nature cell biology 10, 186- 193. Bajard, L., Morelli, L.G., Ares, S., Pecreaux, J., Julicher, F., and Oates, A.C. (2014). Wnt-regulated dynamics of positional information in zebrafish somitogenesis. Development 141, 1381-1391. Bartel, D.P. (2009). MicroRNAs: target recognition and regulatory functions. Cell 136, 215-233. Benazeraf, B., and Pourquie, O. (2013). Formation and segmentation of the vertebrate body axis. Annual review of cell and developmental biology 29, 1-26. Burke, A.C., Nelson, C.E., Morgan, B.A., and Tabin, C. (1995). Hox genes and the evolution of vertebrate axial morphology. Development 121, 333-346. Buttitta, L., Tanaka, T.S., Chen, A.E., Ko, M.S., and Fan, C.M. (2003). Microarray analysis of somitogenesis reveals novel targets of different WNT signaling pathways in the somitic mesoderm. Developmental biology 258, 91-104. Cambray, N., and Wilson, V. (2002). Axial progenitors with extensive potency are localised to the mouse chordoneural hinge. Development 129, 4855-4866. Cambray, N., and Wilson, V. (2007). Two distinct sources for a population of maturing axial progenitors. Development 134, 2829-2840. Carapuco, M., Novoa, A., Bobola, N., and Mallo, M. (2005). Hox genes specify vertebral types in the presomitic mesoderm. Genes & development 19, 2116-2121. Chamorro, M.N., Schwartz, D.R., Vonica, A., Brivanlou, A.H., Cho, K.R., and Varmus, H.E. (2005). FGF-20 and DKK1 are transcriptional targets of beta-catenin and FGF-20 is implicated in cancer and development. EMBO J 24, 73-84. Chen, F., and Capecchi, M.R. (1997). Targeted mutations in hoxa-9 and hoxb-9 reveal synergistic interactions. Developmental biology 181, 186-196. Chung, A.C., Katz, D., Pereira, F.A., Jackson, K.J., DeMayo, F.J., Cooney, A.J., and O'Malley, B.W. (2001). Loss of orphan receptor germ cell nuclear factor function results in ectopic development of the tail bud and a novel posterior truncation. Molecular and cellular biology 21, 663-677. Davis, A.P., and Capecchi, M.R. (1994). Axial homeosis and appendicular skeleton defects in mice with a targeted disruption of hoxd-11. Development 120, 2187- 2198. Davis, A.P., Witte, D.P., Hsieh-Li, H.M., Potter, S.S., and Capecchi, M.R. (1995). Absence of radius and ulna in mice lacking hoxa-11 and hoxd-11. Nature 375, 791-795. Denans, N., Iimura, T., and Pourquie, O. (2015). Hox genes control vertebrate body 171 elongation by collinear Wnt repression. Elife 4. Denzler, R., Agarwal, V., Stefano, J., Bartel, D.P., and Stoffel, M. (2014). Assessing the ceRNA hypothesis with quantitative measurements of miRNA and target abundance. Mol Cell 54, 766-776. Dequeant, M.L., Glynn, E., Gaudenz, K., Wahl, M., Chen, J., Mushegian, A., and Pourquie, O. (2006). A complex oscillating network of signaling genes underlies the mouse segmentation clock. Science 314, 1595-1598. Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., and Gingeras, T.R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15-21. Duboule, D., and Dolle, P. (1989). The structural and functional organization of the murine HOX gene family resembles that of Drosophila homeotic genes. EMBO J 8, 1497-1505. Dubrulle, J., McGrew, M.J., and Pourquie, O. (2001). FGF signaling controls somite boundary position and regulates segmentation clock control of spatiotemporal Hox gene activation. Cell 106, 219-232. Dunty, W.C., Jr., Biris, K.K., Chalamalasetty, R.B., Taketo, M.M., Lewandoski, M., and Yamaguchi, T.P. (2008). Wnt3a/beta-catenin signaling controls posterior body development by coordinating mesoderm formation and segmentation. Development 135, 85-94. Economides, K.D., Zeltser, L., and Capecchi, M.R. (2003). Hoxb13 mutations cause overgrowth of caudal spinal cord and tail vertebrae. Developmental biology 256, 317-330. Favier, B., Le Meur, M., Chambon, P., and Dolle, P. (1995). Axial skeleton homeosis and forelimb malformations in Hoxd-11 mutant mice. Proceedings of the National Academy of Sciences of the United States of America 92, 310-314. Favier, B., Rijli, F.M., Fromental-Ramain, C., Fraulob, V., Chambon, P., and Dolle, P. (1996). Functional cooperation between the non-paralogous genes Hoxa-10 and Hoxd-11 in the developing forelimb and axial skeleton. Development 122, 449- 460. Friedman, R.C., Farh, K.K., Burge, C.B., and Bartel, D.P. (2009). Most mammalian mRNAs are conserved targets of microRNAs. Genome Research 19, 92-105. Galceran, J., Sustmann, C., Hsu, S.C., Folberth, S., and Grosschedl, R. (2004). LEF1- mediated regulation of Delta-like1 links Wnt and Notch signaling in somitogenesis. Genes & development 18, 2718-2723. Garcia, D.M., Baek, D., Shin, C., Bell, G.W., Grimson, A., and Bartel, D.P. (2011). Weak seed-pairing stability and high target-site abundance decrease the proficiency of lsy-6 and other microRNAs. Nat Struct Mol Biol 18, 1139-1146. Gomez, C., Ozbudak, E.M., Wunderlich, J., Baumann, D., Lewis, J., and Pourquie, O. (2008). Control of segment number in vertebrate embryos. Nature 454, 335-339. Graham, A., Papalopulu, N., and Krumlauf, R. (1989). The murine and Drosophila homeobox gene complexes have common features of organization and expression. Cell 57, 367-378. Grimson, A., Farh, K.K., Johnston, W.K., Garrett-Engele, P., Lim, L.P., and Bartel, D.P. (2007). MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Molecular Cell 27, 91-105. 172 Hamburger, V., and Hamilton, H.L. (1951). A series of normal stages in the development of the chick embryo. J Morphol 88, 49-92. Harima, Y., Takashima, Y., Ueda, Y., Ohtsuka, T., and Kageyama, R. (2013). Accelerating the tempo of the segmentation clock by reducing the number of introns in the Hes7 gene. Cell Rep 3, 1-7. He, X., Yan, Y.L., Eberhart, J.K., Herpin, A., Wagner, T.U., Schartl, M., and Postlethwait, J.H. (2011). miR-196 regulates axial patterning and pectoral appendage initiation. Developmental biology 357, 463-477. Heimberg, A., and McGlinn, E. (2012). Building a robust a-p axis. Current genomics 13, 278-288. Hofmann, M., Schuster-Gossler, K., Watabe-Rudolph, M., Aulehla, A., Herrmann, B.G., and Gossler, A. (2004). WNT signaling, in synergy with T/TBX6, controls Notch signaling by regulating Dll1 expression in the presomitic mesoderm of mouse embryos. Genes & development 18, 2712-2717. Hornstein, E., Mansfield, J.H., Yekta, S., Hu, J.K., Harfe, B.D., McManus, M.T., Baskerville, S., Bartel, D.P., and Tabin, C.J. (2005). The microRNA miR-196 acts upstream of Hoxb8 and Shh in limb development. Nature 438, 671-674. Iimura, T., and Pourquie, O. (2006). Collinear activation of Hoxb genes during gastrulation is linked to mesoderm cell ingression. Nature 442, 568-571. Iimura, T., Yang, X., Weijer, C.J., and Pourquie, O. (2007). Dual mode of paraxial mesoderm formation during chick gastrulation. Proc Natl Acad Sci U S A 104, 2744-2749. Ito, K., Nakamura, H., and Watanabe, Y. (2011). Protogenin mediates cell adhesion for ingression and re-epithelialization of paraxial mesodermal cells. Developmental biology 351, 13-24. Jurberg, A.D., Aires, R., Novoa, A., Rowland, J.E., and Mallo, M. (2014). Compartment- dependent activities of Wnt3a/beta-catenin signaling during vertebrate axial extension. Dev Biol 394, 253-263. Jurberg, A.D., Aires, R., Varela-Lasheras, I., Novoa, A., and Mallo, M. (2013). Switching axial progenitors from producing trunk to tail tissues in vertebrate embryos. Dev Cell 25, 451-462. Kramer, A., Green, J., Pollard, J., Jr., and Tugendreich, S. (2014). Causal analysis approaches in Ingenuity Pathway Analysis. Bioinformatics 30, 523-530. Lagos-Quintana, M., Rauhut, R., Meyer, J., Borkhardt, A., and Tuschl, T. (2003). New microRNAs from mouse and human. RNA 9, 175-179. Lee, Y.J., McPherron, A., Choe, S., Sakai, Y., Chandraratna, R.A., Lee, S.J., and Oh, S.P. (2010). Growth differentiation factor 11 signaling controls retinoic acid activity for axial vertebral development. Dev Biol 347, 195-203. Li, Z., Huang, H., Chen, P., He, M., Li, Y., Arnovitz, S., Jiang, X., He, C., Hyjek, E., Zhang, J., et al. (2012). miR-196b directly targets both HOXA9/MEIS1 oncogenes and FAS tumour suppressor in MLL-rearranged leukaemia. Nature communications 3, 688. Lickert, H., Cox, B., Wehrle, C., Taketo, M.M., Kemler, R., and Rossant, J. (2005). Dissecting Wnt/beta-catenin signaling during gastrulation using RNA interference in mouse embryos. Development 132, 2599-2609. Mansfield, J.H., Harfe, B.D., Nissen, R., Obenauer, J., Srineel, J., Chaudhuri, A., Farzan- 173 Kashani, R., Zuker, M., Pasquinelli, A.E., Ruvkun, G., et al. (2004). MicroRNA- responsive 'sensor' transgenes uncover Hox-like and other developmentally regulated patterns of vertebrate microRNA expression. Nature genetics 36, 1079- 1083. McGlinn, E., and Mansfield, J.H. (2011). Detection of gene expression in mouse embryos and tissue sections. Methods in molecular biology 770, 259-292. McGlinn, E., Yekta, S., Mansfield, J.H., Soutschek, J., Bartel, D.P., and Tabin, C.J. (2009). In ovo application of antagomiRs indicates a role for miR-196 in patterning the chick axial skeleton through Hox gene regulation. Proceedings of the National Academy of Sciences of the United States of America 106, 18610- 18615. McGrew, M.J., Sherman, A., Lillico, S.G., Ellard, F.M., Radcliffe, P.A., Gilhooley, H.J., Mitrophanous, K.A., Cambray, N., Wilson, V., and Sang, H. (2008). Localised axial progenitor cell populations in the avian tail bud are not committed to a posterior Hox identity. Development 135, 2289-2299. McLeod, M.J. (1980). Differential staining of cartilage and bone in whole mouse fetuses by alcian blue and alizarin red S. Teratology 22, 299-301. McPherron, A.C., Lawler, A.M., and Lee, S.J. (1999). Regulation of anterior/posterior patterning of the axial skeleton by growth/differentiation factor 11. Nat Genet 22, 260-264. Mikawa, S., Morozumi, T., Shimanuki, S., Hayashi, T., Uenishi, H., Domukai, M., Okumura, N., and Awata, T. (2007). Fine mapping of a swine quantitative trait locus for number of vertebrae and analysis of an orphan nuclear receptor, germ cell nuclear factor (NR6A1). Genome research 17, 586-593. Moran, Y., Fredman, D., Praher, D., Li, X.Z., Wee, L.M., Rentzsch, F., Zamore, P.D., Technau, U., and Seitz, H. (2014). Cnidarian microRNAs frequently regulate targets by cleavage. Genome Res 24, 651-663. Naiche, L.A., Holder, N., and Lewandoski, M. (2011). FGF4 and FGF8 comprise the wavefront activity that controls somitogenesis. Proc Natl Acad Sci U S A 108, 4018-4023. Neijts, R., Simmini, S., Giuliani, F., van Rooijen, C., and Deschamps, J. (2014). Region- specific regulation of posterior axial elongation during vertebrate embryogenesis. Dev Dyn 243, 88-98. Nowotschin, S., Ferrer-Vaquer, A., Concepcion, D., Papaioannou, V.E., and Hadjantonakis, A.K. (2012). Interaction of Wnt3a, Msgn1 and Tbx6 in neural versus paraxial mesoderm lineage commitment and paraxial mesoderm differentiation in the mouse embryo. Dev Biol 367, 1-14. Pollock, R.A., Jay, G., and Bieberich, C.J. (1992). Altering the boundaries of Hox3.1 expression: evidence for antipodal gene regulation. Cell 71, 911-923. Pollock, R.A., Sreenath, T., Ngo, L., and Bieberich, C.J. (1995). Gain of function mutations for paralogous Hox genes: implications for the evolution of Hox gene function. Proceedings of the National Academy of Sciences of the United States of America 92, 4492-4496. Prosser, H.M., Koike-Yusa, H., Cooper, J.D., Law, F.C., and Bradley, A. (2011). A resource of vectors and ES cells for targeted deletion of microRNAs in mice. Nature biotechnology 29, 840-845. 174 Psychoyos, D., and Stern, C.D. (1996). Fates and migratory routes of primitive streak cells in the chick embryo. Development 122, 1523-1534. Ryan, J.F., and Baxevanis, A.D. (2007). Hox, Wnt, and the evolution of the primary body axis: insights from the early-divergent phyla. Biology direct 2, 37. Sawada, A., Shinya, M., Jiang, Y.J., Kawakami, A., Kuroiwa, A., and Takeda, H. (2001). Fgf/MAPK signalling is a crucial positional cue in somite boundary formation. Development 128, 4873-4880. Schroter, C., and Oates, A.C. (2010). Segment number and axial identity in a segmentation clock period mutant. Curr Biol 20, 1254-1258. Spitz, F., Gonzalez, F., Peichel, C., Vogt, T.F., Duboule, D., and Zakany, J. (2001). Large scale transgenic and cluster deletion analysis of the HoxD complex separate an ancestral regulatory module from evolutionary innovations. Genes & development 15, 2209-2214. Takada, S., Stark, K.L., Shea, M.J., Vassileva, G., McMahon, J.A., and McMahon, A.P. (1994). Wnt-3a regulates somite and tailbud formation in the mouse embryo. Genes & development 8, 174-189. Takahashi, M., Fujita, M., Furukawa, Y., Hamamoto, R., Shimokawa, T., Miwa, N., Ogawa, M., and Nakamura, Y. (2002). Isolation of a novel human gene, APCDD1, as a direct target of the beta-Catenin/T-cell factor 4 complex with probable involvement in colorectal carcinogenesis. Cancer research 62, 5651- 5656. Takemoto, T., Uchikawa, M., Yoshida, M., Bell, D.M., Lovell-Badge, R., Papaioannou, V.E., and Kondoh, H. (2011). Tbx6-dependent Sox2 regulation determines neural or mesodermal fate in axial stem cells. Nature 470, 394-398. Trapnell, C., Hendrickson, D.G., Sauvageau, M., Goff, L., Rinn, J.L., and Pachter, L. (2013). Differential analysis of gene regulation at transcript resolution with RNA- seq. Nature biotechnology 31, 46-53. Tsakiridis, A., Huang, Y., Blin, G., Skylaki, S., Wymeersch, F., Osorno, R., Economou, C., Karagianni, E., Zhao, S., Lowell, S., et al. (2014). Distinct Wnt-driven primitive streak-like populations reflect in vivo lineage precursors. Development 141, 1209-1221. van den Akker, E., Fromental-Ramain, C., de Graaff, W., Le Mouellic, H., Brulet, P., Chambon, P., and Deschamps, J. (2001). Axial skeletal patterning in mice lacking all paralogous group 8 Hox genes. Development 128, 1911-1921. Velu, C.S., Chaubey, A., Phelan, J.D., Horman, S.R., Wunderlich, M., Guzman, M.L., Jegga, A.G., Zeleznik-Le, N.J., Chen, J., Mulloy, J.C., et al. (2014). Therapeutic antagonists of microRNAs deplete leukemia-initiating cell activity. The Journal of clinical investigation 124, 222-236. Vonk, F.J., Casewell, N.R., Henkel, C.V., Heimberg, A.M., Jansen, H.J., McCleary, R.J., Kerkkamp, H.M., Vos, R.A., Guerreiro, I., Calvete, J.J., et al. (2013). The king cobra genome reveals dynamic gene evolution and adaptation in the snake venom system. Proceedings of the National Academy of Sciences of the United States of America 110, 20651-20656. Weidinger, G., Thorpe, C.J., Wuennenberg-Stapleton, K., Ngai, J., and Moon, R.T. (2005). The Sp1-related transcription factors sp5 and sp5-like act downstream of Wnt/beta-catenin signaling in mesoderm and neuroectoderm patterning. Current 175 biology : CB 15, 489-500. Wellik, D.M. (2007). Hox patterning of the vertebrate axial skeleton. Developmental dynamics : an official publication of the American Association of Anatomists 236, 2454-2463. Wellik, D.M., and Capecchi, M.R. (2003). Hox10 and Hox11 genes are required to globally pattern the mammalian skeleton. Science 301, 363-367. Yekta, S., Shih, I.H., and Bartel, D.P. (2004). MicroRNA-directed cleavage of HOXB8 mRNA. Science 304, 594-596. Yekta, S., Tabin, C.J., and Bartel, D.P. (2008). MicroRNAs in the Hox network: an apparent link to posterior prevalence. Nature reviews Genetics 9, 789-796. Young, T., Rowland, J.E., van de Ven, C., Bialecka, M., Novoa, A., Carapuco, M., van Nes, J., de Graaff, W., Duluc, I., Freund, J.N., et al. (2009). Cdx and Hox genes differentially regulate posterior axial growth in mammalian embryos. Developmental cell 17, 516-526. Zakany, J., Gerard, M., Favier, B., and Duboule, D. (1997). Deletion of a HoxD enhancer induces transcriptional heterochrony leading to transposition of the sacrum. EMBO J 16, 4393-4402. Zhang, Z., O'Rourke, J.R., McManus, M.T., Lewandoski, M., Harfe, B.D., and Sun, X. (2011). The microRNA-processing enzyme Dicer is dispensable for somite segmentation but essential for limb bud positioning. Developmental biology 351, 254-265. 176 Figures and figure legends Fig 1. Unique and overlapping expression patterns of miR-196 paralogs in mouse. (A) Mouse Hox clusters, with the position of Hox-embedded microRNAs depicted. Predicted Hox targets of the miR-196 family are indicated in blue. (B-K) Detection of eGFP transcripts in miR-196a1GFP/+ (B-F) and miR-196a2GFP/+ (G-K) embryos demonstrates spatio-temporal expression differences for these identical miRNAs. Embryonic age indicated, red and white arrowheads indicate the anterior boundary of somitic and neural expression respectively. Arrows in (F,G) indicate weak ventral expression in miR-196a2GFP/+ embryos. Inset in (D,E) indicates reduced eGFP signal in the anterior PSM of 196a1GFP/+ embryos. Fig 2. miR-196 paralogs function in establishing vertebral identity and number in mouse. (A) Identification of vertebral patterning defects in individual and compound mir-196 loss-of-function E18.5 embryos. Genotypes indicated. The positions of the 13th thoracic element (T13) and first sacral element (S1) are labelled. Inset displays the thoracic- lumbar junction. (B) Individual vertebra analysis to demonstrate identity alterations at the thoraco-lumbar and lumbo-sacral junctions. Genotypes indicated. The position of a rib- like nubbin on lumbar elements is marked with arrow. The position of sacral process is marked with an asterisk. (C) Rib fusion defects observed following loss of miR-196 alleles, genotypes indicated. Fusion of the 8th rib to the sternum was unilateral or bilateral as indicated with arrows. (D) Summary of patterning defects identified across the miR- 196 allelic series. An asterisk indicates a homeotic transformation of that vertebral 177 element. (E) Quantification of vertebral number in single and compound mir-196 loss-of- function E18.5 embryos identifies a role for miR-196 in controlling axis length in mouse. Statistical comparison of vertebral number relative to wildtype were performed using a permutation test, with P values corrected for multiple hypothesis testing using the Bonferroni method; * P < 0.05, *** P < 0.001, **** P < 0.0001. Fig 3. Whole transcriptome analysis of miR-196 mutant cells reveals a dysregulation of miRNA targets and skeletal genes (A) Overview of the experimental and computational strategy used to identify global transcriptome alterations following loss of miR-196 function. (B) Mean fold changes of genes associated with predicted targets of miR-196, partitioned into four context+ intervals according to predicted miRNA targeting efficacy (0 < context+ < -0.2, n=2112; -0.2 ≤ context+ < -0.3, n=145; -0.3 ≤ context+ < -0.4, n=50; context+ ≤ -0.4, n=37), across seven genotype comparisons. Statistical comparison of observed up-regulation of genes relative to genes with no miRNA target site, as evaluated by a one-sided Kolmogorov-Smirnov (K-S) test; * P < 0.05, ** P < 0.001. (C) Top 10 significant categories related to gene development and function associated with differentially expressed genes. (D) Top 15 categories related to “skeletal and muscular development” activated in the 196a2–/–;196b–/– vs 196a2-/+ comparison, with corresponding activation z- scores and P values. An activation z-scores is a measurement of the consistency between the observed pattern of up- and down-regulation of genes in a category and the predicted activation or inhibition pattern in networks stored in the Ingenuity Knowledgebase relative to a random pattern (Kramer et al., 2014). P values in (C,D) are Benjamini- 178 Hochberg corrected P values, with dashed black lines indicating a significance threshold of 0.01. Fig 4. Loss of miR-196 function alters global Hox signatures. (A) Extensive Hox gene dysregulation is identified following loss of miR-196. Quantitative expression analysis of all 39 Hox genes in cells isolated from E9.5 mutant embryos, genotype comparisons are color-coded. Hox genes with one or more predicted miR-196 target binding sites are indicated in red. Filled circles at the tips of fold changes represent a statistically-significant change at q < 0.05. (B,C) WISH analysis of miR- 196a2GFP/GFP;miR-196b–/– E9.5 embryos relative to wildtype identifies a caudal expansion of Hoxb8 (B; n= 3/3) and Hoxc8 (C; n=3/3). Expression within the PSM is indicated with a red line/arrowhead, neural tube expression with a white arrowhead. Fig. 5. Identification of additional putative direct (non-Hox) miR-196 targets. (A) List of the most highly upregulated genes, and their associated fold changes in seven genotype comparisons, that either: i) contain a conserved miR-196 binding site, or ii) are predicted to respond strongly to the miRNA (i.e. have a context+ score ≤ -0.2). Genes with one or more conserved miR-196 target binding sites are indicated in green. (B) In vitro luciferase analysis confirms sequence-specific regulation of 3 experimentally supported target genes of miR-196. Renilla luciferase intensity values have been normalized to their respective Firefly values (RLU). Controls (wildtype 3′ UTR construct without miR-196b) were set to 1. MUT: mutated 3′ UTR construct destroying miR-196 binding site. Error bars represent standard deviation. P values, students t-test, * P < 0.05, 179 *** P < 0.0005, **** P < 0.0001. Fig 6. Loss of miR-196 function alters signaling pathways known to control segmentation, axis elongation and the trunk-to-tail transition. (a) Quantitative expression analysis of pathways known to control segmentation and axial extension in cells isolated from E9.5 mutant embryos, genotype comparisons are color-coded. Filled circles at the tips of fold changes represent a statistically-significant change at Q < 0.05. Fig 7. miR-196 has the potential to regulate Wnt signaling by both direct and indirect mechanisms. (A) WISH analysis confirms increased Dkk1 in 196a1–/– ;196a2–/– E9.5 embryos relative to 196a1-/+;196a2-/+ (n = 2/2 for each genotype). (B) In vitro luciferase assay confirms sequence-specific regulation of Dkk1 by miR-196 . Renilla luciferase intensity values have been normalized to their respective Firefly values (RLU). Controls (wildtype 3′ UTR construct without miR-196b) were set to 1. MUT: mutated 3′ UTR construct destroying miR-196 binding site. (C) Luciferase assay measuring Wnt/β-catenin activity after over-expression of BATLuc together with CMV-Renilla and either control, Hoxb1, Hoxa5, Hoxa7, Hoxb7, Hoxb8, or Hoxc8; n= 4-9 samples per gene assessed. Firefly luciferase intensity values have been normalized to their respective Renilla values (RLU). Control value were set to 1. In (B) and (C), error bars represent standard deviation. Reported P values are from the Students t-test, * P < 0.05, ** P < 0.005, *** P < 0.0005, **** P < 0.0001. 180 Supplementary Fig S1. Generation of miR-196a1GFP and miR-196a2GFP knock-in mouse lines. (A) miR-196a1GFP knock-in targeting strategy and (B) confirmation of correct targeting by Southern blot analysis of BStZ17I genomic digestion. The position of Southern probe is indicated with a blue box in (A). (C) miR-196a2GFP knock-in targeting strategy and (D) confirmation of correct targeting by Southern blot analysis of Swa1 genomic digestion. The position of Southern probe is indicated with a blue box in (C). Supplementary Fig S2. Generation of miR-196a1–/– and miR-196a2–/– and miR-196b– /– knockout mouse lines. (A) Generalized targeting strategy employed by the Wellcome Trust Sanger Institute to create miRNA knockout ES cells (Prosser et al., 2011). Prior to ES cell injection, correct targeting was confirmed in house by Southern blot analysis of the miR-196a1–/– (B), miR-196a2–/– (C) and miR-196b–/– (D) loci. The general Southern blot strategy is indicated in blue in (A). Supplementary Fig S3. Summary of vertebral patterning alterations observed in miR-196 single and compound mutant mice. A cartoon summary of the main patterning defects observed in miR-196 mutant mice, homeotic transformation of the wildtype axial formulae are marked in with an asterisk. The numbers of skeletons analyzed for each genotype and their phenotypic spectrum is indicated. 181 Supplementary Fig S4. Predicted miRNA target genes are up-regulated upon the loss of miR-196. (A-G) Cumulative density plots of the fold changes of genes predicted as targets of miR- 196, partitioned into four context+ intervals according to increasing predicted miRNA targeting efficacy (0 < context+ < -0.2, n=2112; -0.2 ≤ context+ < -0.3, n=145; -0.3 ≤ context+ < -0.4, n=50; context+ ≤ -0.4, n=37), and genes with no predicted target site (n=6924), across seven genotype comparisons. The P values indicate a statistical comparison of the observed de-repression of genes relative to genes with no miRNA target site, as evaluated by a one-sided Kolmogorov-Smirnov (K-S) test. Supplementary Fig S5. Significant functional categories associated with differentially expressed genes. (A-G) All significant categories related to gene development and function associated with differentially expressed genes, across seven genotype comparisons. All P values are Benjamini-Hochberg corrected, with dashed black lines indicating a significance threshold of 0.01. Supplementary Fig S6. Inference of upstream regulators reveals a downregulation of Wnt activity. (A) Upstream regulators inferred by Ingenuity Pathway Analysis as being dysregulated based upon the behavior of differentially-expressed genes in three genotype comparisons. Activation z-scores and P values are computed as described in Figure 3D. As a positive control, miR-196 is correctly inferred as the most significant miRNA to have diminished 182 activity. β-catenin/CTNNB1 (Wnt) activity is predicted to also diminish with the loss of miR-196; in contrast, MYCN, MYC and SRF activity is predicted to become activated. (B) Network of upstream and downstream interactions in the Ingenuity knowledgebase that were used to infer decreased Wnt activity in the 196a2–/–;196b–/– vs 196a2-/+ comparison. Genes are shaded according to their observed up- or down-regulation in this comparison. Supplementary Table 1. Removal of miR-196 family members causes axial patterning defects. Summary of vertebral malformations and vertebral transformations identified in single and compound miR-196 knockout mice. Supplementary Table 2: RNA-seq library statistics Summary of read mapping statistics associated with the 44 RNA-seq samples generated in this study, including total number of reads sequenced per sample and total mapped to the mouse genome (mm10). 183 184 185 AD B C 0 < context+ <í í” context+ <í í” conte[Wí context+ ”í Dí í YV Dí  Dí í YV Dí  Dí íD í íY VD í D í Dí íE í YV D í Dí E íí YV D í Dí íE íí YV D í M ea n fo ld c ha ng e (lo g  ) ** * * ** ** ** ** ** * * ** ** * **** CardioYDVFXODU6\VWHP'eY ConnectiYH7LVVXH'eY Organ MorSKRORJ\ 6kHOHWDODQG0XVFXODU6\VWHP'eY 2UJDQ'eY 2UJDQLVPDO'eY (PEr\RQLF'eY 7LVVXH'eY 2UJDQLVPDO6XrYLYal 7LVVXH0RrSKRORJ\ íORJ10(P YDOXH 0 5 10 15 DEYVD DDYVDD DYVD DYVD F$&6VRUW *)3FHOOV 51$VHT •ELRORJLFDOUHSOLFDWHVJHQRW\SH Wong et al., Figure 3 4XDQWLW\RIYerWHErae 'LIfHUHQWLDWLRQRIRVWHREODVWV Cartilage deYHORSPHQW MorSKRORJ\RIMaw MorSKRORJ\RIrLE $EQRrPDOPRrpholog\RIERQH )XVLRQRIYerWHErae MorSKRORJ\RIOLPE )XVLRQRIERQH MorSKRORJ\RIVkeleton MorSKRORJ\RIVNXOO MorSKRORJ\RID[LDOVkeleton MorSKRORJ\RIYerWHErae MorSKRORJ\RIYerWHErDOFROXPQ MorSKRORJ\RIERQH ActiYDWLRQ]íVFRUH 0 1    0   6 8 10  íORJ10(P YDOXH      Dí D í íY VD í D í ** ** ** * 'LfIHUHQWLDOH[SUHVVLRQDQDO\VLV FXffdifI 3DWKZD\DQDO\VLV ,QJHQXLW\ PL5WDUJHW DQDO\VLV TDUJHW6FDQ  186 187 188 189 190 A B Wong et al., Figure S1 eGFP 5’ Hom arm 3’ Hom arm UTR Neo Swa1 Swa1 miR-196a2 Wildtype Targeting construct Targeted allele Neo removal C D W ild ty pe Ta rg et ed 2.9 Kb 5.9 Kb Δ52bp DipTox eGFP 5’ Hom arm 3’ Hom arm UTR Neo 965bp 6179bp 5.9 Kb eGFP 5’ Hom arm 3’ Hom arm UTR 2.8 Kb Swa1 Swa1 Ta rg et ed W ild ty pe eGFP 5’ Hom arm 3’ Hom arm UTR Neo BstZ17I BstZ17I miR-196a1 Wildtype Targeting construct Targeted allele Neo removal W ild ty pe Ta rg et ed 6.2 Kb 9.4 Kb Δ72bp FRT BstZ17I BstZ17I DipTox eGFP 5’ Hom arm 3’ Hom arm UTR Neo 1545bp 8012bp 9.4 Kb eGFP 5’ Hom arm 3’ Hom arm UTR 6.2 Kb Hoxb9 FRT 191 PGK 5’ Hom arm 3’ Hom arm puDeltaTK BGHpA Digest site Digest site miRNA Wildtype Targeting construct Targeted allele Selection removal A B W ild ty pe Ta rg et ed Δ≈200bp LoxP F3 FRT Digest site Digest site PGK 5’ Hom arm 3’ Hom arm puDeltaTK BGHpA 5’ Hom arm 3’ Hom arm Ta rg et ed W ild ty pe Hpa1 Digest C HindIII Digest D W ild ty pe Ta rg et ed BspH1 Digest Ta rg et ed Ta rg et ed W ild ty pe W ild ty pe Ta rg et ed Confirmation of miR-196a1 targeting Confirmation of miR-196a2 targeting Confirmation of miR-196b targeting Wong et al., Figure S2 192 Wildtype n=47 1 46 Single mutants 196a1-/- n=28 7 19 196a2-/- n=21 4 12 5 196b-/- n=26 13 9 4 Double mutants 196a1-/-;196a2-/- n=18 0 13 5 196a2-/-;196ab-/- n=13 0 13 196a1-/-;196b-/- n=5 1 2 2 Triple knockout - Allelic series 196a1+/-;196a2+/-;196b+/- n=11 4 1 6 196a1+/-;196a2+/-;196b-/- n=12 0 12 196a1+/-;196a2-/-;196b+/- n=7 0 7 196a1+/-;196a2-/-;196b-/- n=8 0 8 196a1-/-;196a2+/-;196b+/- n=7 2 1 4 196a1-/-;196a2+/-;196b-/- n=6 1 5 196a1-/-;196a2-/-;196b+/- n=3 0 3 196a1-/-;196a2-/-;196b-/- n=3 0 3 Wong et al., Figure S3 T13 T13 L1 L2 L3 L4 L5 L6 S1 S2 S3 S4 Ca1 T13 T14* L2 L3 L4 L5 L6 S1 S2 S3 S4 Ca1 L1* L2* L3* L4* L5* L6* S1* S2* S3* S4* Ca1* T14* T15* T13 L2* L3* L4* L5* L6* S1* S2* S3* S4* Ca1* T14* 20 21 22 23 24 25 26 27 28 29 30 31 32 Wildtype L1-to-T L1-to-T Posterior sacral displacement L1-to-T L2-to-T Posterior sacral displacement T13 L1 L2 L3 L4 L5 S1 S2 S3 S4 Ca1 Anterior sacral displacement with or without T13 rib reduction 193 AD B E C í í í     a1-/-a2-/- vs a1-/+a2-/+ FRQWe[Wí3í í”FRQWe[Wí3í í”FRQWe[Wí3 FRQWe[W”í3í 1RVLWH &X PX ODW LYH IUD FWL RQ       )ROGFKDQJH ORJ) a1-/- vs a1-/+ FRQWe[Wí3í í”FRQWe[Wí3 í”FRQWe[Wí3 FRQWe[W”í3 1RVLWH í í í     &X PX ODW LYH IUD FWL RQ       )ROGFKDQJH ORJ) a2-/- vs a2-/+ FRQWe[Wí3 í”FRQWe[Wí3 í”FRQWe[Wí3 FRQWe[W”í3í 1RVLWH a2-/-b-/- vs a2-/+ í í í     &X PX ODW LYH IUD FWL RQ       )ROGFKDQJH ORJ) í í í     &X PX ODW LYH IUD FWL RQ       )ROGFKDQJH ORJ) FRQWe[Wí3 í”FRQWe[Wí3 í”FRQWe[Wí3 FRQWe[W”í3í 1RVLWH a2-/+b-/- vs a2-/+ FRQWe[Wí3 í”FRQWe[Wí3í í”FRQWe[Wí3 FRQWe[W”í3í 1RVLWH í í í     &X PX ODW LYH IUD FWL RQ       )ROGFKDQJH ORJ) F a2-/-b-/+ vs a2-/+ í í í     &X PX ODW LYH IUD FWL RQ       )ROGFKDQJH ORJ) FRQWe[Wí3 í”FRQWe[Wí3 í”FRQWe[Wí3 FRQWe[W”í3í 1RVLWH G a1-/+a2-/- vs a1-/+a2-/+ í í í     FRQWe[Wí3í í”FRQWe[Wí3í í”FRQWe[Wí3 FRQWe[W”í3í 1RVLWH &X PX ODW LYH IUD FWL RQ       )ROGFKDQJH ORJ) Wong et al., Figure S4 194 A B C D E F Wong et al., Figure S5 a1-/- vs a1-/+ a2-/- vs a2-/+ a1-/-a2-/- vs a1-/+a2-/+ a2-/+b-/- vs a2-/+ a2-/-b-/+ vs a2-/+ a2-/-b-/- vs a2-/+ íORJ10(P vDOXH 0 5 10 15 5HSURGXFWLvH6\s. DHv. 7LVVXH0RUSKRORJ\ 7LVVXH'Hv. 6kHOHWDO 0XVFXODU6\s. DHv. 2UJDQLVPDO'Hv. 2UJDQ0RUSKRORJ\ 2UJDQ'Hv. 1HUvRXV6\s. DHv. (PEU\RQLF'Hv. &RQQHFWLvH7LVVXH'Hv. +HSDWLF6\s. DHv. 5HVSLUDWRU\6\s. DHv. +DLU 6NLQ'Hv. 'LJHVWLvH6\s. DHv. 2UJDQLVPDO)XQ 5HSURGXFWLvH6\s. DHv. &DUGLRvDVFXODU6\s. DHv. 1HUvRXV6\s. DHv. 7LVVXH0RUSKRORJ\ +HPDWRORJLFDO6\s. DHv. 2UJDQLVPDO6XUYLvDO 5HQDO 8URORJLFDO6\s. DHv. 7LVVXH'Hv. 2UJDQLVPDO'Hv. 2UJDQ0RUSKRORJ\ 2UJDQ'Hv. (PEU\RQLF'Hv. &RQQHFWLvH7LVVXH'Hv. 6kHOHWDO 0XVFXODU6\s. DHv. íORJ10(P vDOXH 0 5 10 15 a1-/+a2-/- vs a1-/+a2-/+ AXGLWRU\ VHVWLEXODU6\s. DHv. 2UJDQLVPDO6XUYLvDO &RQQHFWLvH7LVVXH'Hv. 1HUvRXV6\s. DHv. 7LVVXH0RUSKRORJ\ 7LVVXH'Hv. 6kHOHWDO 0XVFXODU6\s. DHv. 2UJDQLVPDO'Hv. 2UJDQ0RUSKRORJ\ 2UJDQ'Hv. (PEU\RQLF'Hv. &DUGLRvDVFXODU6\s. DHv. íORJ10(P vDOXH 0 5 10 15 L\PSKRLG7LVVXH6WUXF 'Hv. AXGLWRU\ VHVWLEXODU6\s. DHv. 9LVXDO6\s. DHv. +HSDWLF6\s. DHv. +DLU 6NLQ'Hv. 5HVSLUDWRU\6\s. DHv. +HPDWRSRLHVLV +HPDWRORJLFDO6\s. DHv. (QGRFULQH6\s. DHv. 'LJHVWLvH6\s. DHv. TXPRU0RUSKRORJ\ 5HSURGXFWLvH6\s. DHv. 5HQDO 8URORJLFDO6\s. DHv. 2UJDQLVPDO6XUYLvDO &DUGLRvDVFXODU6\s. DHv. 7LVVXH0RUSKRORJ\ 7LVVXH'Hv. 6kHOHWDO 0XVFXODU6\s. DHv. 2UJDQ0RUSKRORJ\ 2UJDQ'Hv. 1HUvRXV6\s. DHv. &RQQHFWLvH7LVVXH'Hv. 2UJDQLVPDO'Hv. (PEU\RQLF'Hv. íORJ10(P vDOXH 0 5 10 15 5HSURGXFWLvH6\s. DHv. 'LJHVWLvH6\s. DHv. AXGLWRU\ VHVWLEXODU6\s. DHv. 1HUvRXV6\s. DHv. 2UJDQLVPDO6XUYLvDO 7LVVXH0RUSKRORJ\ &DUGLRvDVFXODU6\s. DHv. 7LVVXH'Hv. 2UJDQ'Hv. &RQQHFWLvH7LVVXH'Hv. 2UJDQLVPDO'Hv. (PEU\RQLF'Hv. 6kHOHWDO 0XVFXODU6\s. DHv. 2UJDQ0RUSKRORJ\ íORJ10(P vDOXH 0 5 10 15 +HSDWLF6\s. DHv. +HPDWRORJLFDO6\s. DHv. %HKaYLRU 9LVXDO6\s. DHv. 5HSURGXFWLvH6\s. DHv. +DLU 6NLQ'Hv. AXGLWRU\ VHVWLEXODU6\s. DHv. 5HQDO 8URORJLFDO6\s. DHv. 'LJHVWLvH6\s. DHv. 5HVSLUDWRU\6\s. DHv. 1HUvRXV6\s. DHv. 7LVVXH0RUSKRORJ\ 7LVVXH'Hv. 6kHOHWDO 0XVFXODU6\s. DHv. 2UJDQLVPDO'Hv. 2UJDQ0RUSKRORJ\ 2UJDQ'Hv. (PEU\RQLF'Hv. &RQQHFWLvH7LVVXH'Hv. 2UJDQLVPDO6XUYLvDO &DUGLRvDVFXODU6\s. DHv. íORJ10(P vDOXH 0 5 10 15 G 5HSURGXFWLvH6\s. DHv. 5HQDO 8URORJLFDO6\s. DHv. AXGLWRU\ VHVWLEXODU6\s. DHv. +DLU 6NLQ'Hv. 5HVSLUDWRU\6\s. DHv. 1HUvRXV6\s. DHv. 2UJDQ0RUSKRORJ\ &RQQHFWLvH7LVVXH'Hv. 'LJHVWLvH6\s. DHv. 7LVVXH0RUSKRORJ\ 6kHOHWDO 0XVFXODU6\s. DHv. 7LVVXH'Hv. 2UJDQLVPDO'Hv. 2UJDQ'Hv. (PEU\RQLF'Hv. 2UJDQLVPDO6XUYLvDO &DUGLRvDVFXODU6\s. DHv. íORJ10(P vDOXH 0 5 10 15 195 AActivDWLRQ]íVFRUH íORJ10(P vDOXH 0 5 10 15 CTNNB1 PL5í SRF MYC MYCN í í 0   DEYVD DEYVD DEYVD  Inferred upstream regulators B 38 genes Wong et al., Figure S6 196      !         (           (  )       ,           ( *  #$ (     (   (. )    !   "  " " "  !  !         #   $      #   " #            " # $        #           #   $ "  !  !  !  !  #              # $        #   ! ! !    #  $  "  " " !  " #                                            " !    !  "    "    "  "                       # #          #                              #     !   #   "                  !  #                       !                      !    !  !      "      "  "      !   !                                                                ,#  $   , "-    -           ( '#  $   ( (                     )+    )+ "),   ) ,#  $  ), ")-   ) - 197 #" !&     %      )         $         $                   "      !       "      !       "      !       "      !       "      !             !             !            !             !             !       "      !       "      !       "       !       "      !       "      !             !              !             !             !             !       " "       !       " "       !       " "       !       " "       !       "        !       "        !       "        !       "        !               !               !              !               !       "       !       "       !       "       !       "       !        "      !        "      !        "       !        "      !              !              !              !              !      198 Chapter 4. Future Directions Quantitative models of miRNA targeting in Drosophila Though much work has been done to understand the determinants of miRNA target recognition that enhance prediction in mammals, relatively little has been done in other clades of animal life, including important invertebrate model organisms such as the worm (C. elegans) and the fruit fly (D. melanogaster). Indeed, there are only a handful of models of miRNA target prediction that exist for these clades, and many are based upon purely evolutionary information which captures little information about the strength of repression conferred. Understanding the similarities and differences in these key model organisms relative to mammals would have several benefits: i) it would be interesting from an evolutionary perspective, providing a glimpse into the fundamental principles of miRNA targeting common to animals, and ii) it would give insight into the construction of gene regulatory networks in each of these species, which would aid in the interpretation of gene expression data and molecular pathways perturbed in different experimental conditions. To elucidate the principles of miRNA target recognition in mammals, the mammalian miRNA field has benefited greatly from its ability to generate gene expression datasets derived from miRNA transfections in cell culture. While it is difficult to culture cell lines derived from the worm, such limitations do not exist for the fly due to the availability of cultured S2 cells derived from D. melanogaster (Schneider, 1972). To explore the features associated with effective miRNA targeting in the fly, we performed a series of six miRNA transfection experiments in cultured S2 cells, quantifying the abundance of all expressed transcripts by RNA sequencing relative to the corresponding 199 abundance in mock-transfected cells (e.g., using the scheme emulating the one illustrated in Figure 5A, pg. 23). We then computed log2(fold changes) in mRNA levels between miRNA-transfected cells and day-matched mock transfections for all genes that were detectable. Initial analyses of these data confirmed that Drosophila employs at least five canonical target sites (i.e., 8mer, 7mer-m8, 7mer-A1, 6mer, and offset 6mer sites) resembling those of mammals and that the hierarchy of effectiveness of these sites also parallels that of mammals (i.e., as illustrated in Figure 5B, pg. 23). Preliminary efforts to characterize features associated with repression reveal that determinants guiding site efficacy in Drosophila seem to be only a subset of those detected as being important in mammals, with RNA structural accessibility and 3′ UTR length being chosen most consistently as features useful for prediction. As an orthogonal means of evaluating the usage of these canonical sites in the transcriptome, I investigated site conservation, which required extension of our comparative sequence methods to the insect clade (which consists of 12 species of Drosophila as well as three other insect species). This analysis revealed that at least 11,000 miRNA–target interactions have been selectively conserved in fly 3′ UTRs. It remains to be determined whether the computation of site conservation (PCT) values will improve the ability of a regression model to discern effective miRNA target sites. Collectively, I find that a core set of features informative for prediction are common to both flies and mammals, although the poor support in the fly for features that are important for prediction in mammals implies that several principles of miRNA targeting have diverged between the two clades. 200 Conservation of miRNA targeting networks among bilaterians Several studies have identified 34 ancient miRNA families common to most bilaterian organisms (Figure 1, pg. 13) (Grimson et al., 2008; Wheeler et al., 2009). Despite evidence that these ancient miRNAs have conserved spatiotemporal dynamics in early animal development (Christodoulou et al., 2010), few efforts have attempted to determine whether they participate in similar regulatory networks across large evolutionary timespans encompassing bilaterian life. A previous study that attempted to uncover ancient miRNA–target relationships among the vertebrate, fly, and worm clades failed to detect many such examples (Chen and Rajewsky, 2006), but was potentially limited due to the following reasons: i) poor annotation of ancient miRNAs, ii) poor annotation of the 3′ UTRs of sequenced human, fly, and worm genomes, iii) limited methods of the reconstruction of orthologous/paralogous relationships across clades, and iv) the restricted ability to identify conserved miRNA target sites within clades due to lower quality multiple sequence alignments. These limitations provided the motivation to revisit these questions using enhanced methods of defining orthologous relationships among bilaterian gene families (Wu et al., 2014). Having compiled a list of conserved miRNA target sites in the 3′ UTRs of the worm, fly, and vertebrate clades, I sought to estimate the number of ancient sites that have persisted in orthologous genes since the bilaterian ancestor arose ~600 million years ago. Using a bootstrapping technique to generate 1000 sampled ortholog lists matched for 3′ UTR length, A/U content, and conservation rate, I have estimated there to be approximately 13 deeply conserved sites (p < 0.002) shared among the bilaterians and 19 (p < 0.047) shared among the protostomes (fly and worm clades). The paucity of such 201 sites reinforces the model suggested by Chen and Rajewsky (2006), that there has been extensive rewiring in the miRNA networks across these three major clades of bilaterians. Although there appear to be few miRNA–target relationships preserved from the common ancestor of bilaterians, these results must be interpreted in the context of sampling bias among the species used in the analysis. In particular, the fly and worm clades may not be representative of the ancestral state as they have undergone massive gene loss and genome compaction, with as many as 10% of ancestral genes having been lost in these species (Raible and Arendt, 2004). A better approach may be to utilize species that are early-branching bilaterians with a slow rate of molecular evolution, such as the acoel (Hofstenia miamia) and planarian (Schmidtea mediterranea), which would be more representative of the ancestral state of bilaterians (Srivastava et al., 2014). Future work would thus aim to extend the search for such ultraconserved sites to deep-branching phyla within the bilaterians, improving the annotation of miRNAs and 3′ UTRs in these species as a prerequisite to further analysis. 202 References Chen, K., and Rajewsky, N. (2006). Deep conservation of microRNA-target relationships and 3'UTR motifs in vertebrates, flies, and nematodes. Cold Spring Harb Symp Quant Biol 71, 149-156. Christodoulou, F., Raible, F., Tomer, R., Simakov, O., Trachana, K., Klaus, S., Snyman, H., Hannon, G.J., Bork, P., and Arendt, D. (2010). Ancient animal microRNAs and the evolution of tissue identity. Nature 463, 1084-1088. Grimson, A., Srivastava, M., Fahey, B., Woodcroft, B.J., Chiang, H.R., King, N., Degnan, B.M., Rokhsar, D.S., and Bartel, D.P. (2008). Early origins and evolution of microRNAs and Piwi-interacting RNAs in animals. Nature 455, 1193-1197. Raible, F., and Arendt, D. (2004). Metazoan evolution: some animals are more equal than others. Curr Biol 14, R106-108. Schneider, I. (1972). Cell lines derived from late embryonic stages of Drosophila melanogaster. J Embryol Exp Morphol 27, 353-365. Srivastava, M., Mazza-Curll, K.L., van Wolfswinkel, J.C., and Reddien, P.W. (2014). Whole-body acoel regeneration is controlled by Wnt and Bmp-Admp signaling. Curr Biol 24, 1107-1113. Wheeler, B.M., Heimberg, A.M., Moy, V.N., Sperling, E.A., Holstein, T.W., Heber, S., and Peterson, K.J. (2009). The deep evolution of metazoan microRNAs. Evol Dev 11, 50-68. Wu, Y.-C., Bansal, M.S., Rasmussen, M.D., Herrero, J., and Kellis, M. (2014). Phylogenetic Identification and Functional Characterization of Orthologs and Paralogs across Human, Mouse, Fly, and Worm. 203 204 Appendix 1. Global analysis of the effect of different cellular contexts on microRNA targeting Jin-Wu Nam1,2,3,4,9, Olivia S. Rissland1,2,3,9, David Koppstein1,2,3, Cei Abreu-Goodger5,6, Calvin Jan1,2,3, Vikram Agarwal1,2,7, Muhammed A. Yildirim1,2,3, Antony Rodriguez5,8, and David P. Bartel1,2,3 1Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA 2Howard Hughes Medical Institute 3Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA 4Graduate School of Biomedical Science and Engineering, Hanyang University, Seoul, Korea. 5Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, 77030 USA 6Current address: Laboratorio Nacional de Genómica para la Biodiversidad (Langebio), CINVESTVA, Irapuato, Guanajuato, México 7Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA 8Current Address: Department of Physical Therapy, University of Texas Medical Branch Galveston, 301 University Blvd, Galveston, Texas, 77555 9These authors contributed equally to this work V.A. helped devise the weighted context+ model. J.W.N., C.A.G., and M.A.Y. performed computational analyses. O.S.R. performed microRNA transfections. D.K. and C.J. generated 3P-Seq libraries in mouse and human samples, respectively. A.R. created knockout mice. J.W.N., O.S.R., and D.P.B. designed the study. J.W.N., O.S.R., and D.P.B. wrote the manuscript. Published as: Nam J-W, Rissland OS, Koppstein D, Abreu-Goodger C, Jan CH, Agarwal V, Yildirim MA, Rodriguez A, Bartel DP. "Global analysis of the effect of different cellular contexts on microRNA targeting". 2014. Molecular Cell 53(6):1031-43. 205 Molecular Cell Resource Global Analyses of the Effect of Different Cellular Contexts on MicroRNA Targeting Jin-Wu Nam,1,2,3,4,8 Olivia S. Rissland,1,2,3,8 David Koppstein,1,2,3 Cei Abreu-Goodger,5 Calvin H. Jan,1,2,3 Vikram Agarwal,1,2,6 Muhammed A. Yildirim,1,2,3 Antony Rodriguez,7,9 and David P. Bartel1,2,3,* 1Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA 2Howard Hughes Medical Institute 3Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA 4Department of Life Science, College of Natural Science and Graduate School of Biomedical Science and Engineering, Hanyang University, Seoul 133-791, Korea 5Laboratorio Nacional de Geno´mica para la Biodiversidad (Langebio), CINVESTAV, Irapuato, Guanajuato 36824, Me´xico 6Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA 02139, USA 7Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA 8These authors contributed equally to this work 9Present address: Department of Physical Therapy, University of Texas Medical Branch Galveston, 301 University Boulevard, Galveston, TX 77555, USA *Correspondence: dbartel@wi.mit.edu http://dx.doi.org/10.1016/j.molcel.2014.02.013 SUMMARY MicroRNA (miRNA) regulation clearly impacts animal development, but the extent towhich development— with its resulting diversity of cellular contexts— impacts miRNA regulation is unclear. Here, we compared cohorts of genes repressed by the same miRNAs in different cell lines and tissues and found that target repertoires were largely unaffected, with secondary effects explaining most of the differential responses detected. Outliers resulting from differen- tial direct targeting were often attributable to alterna- tive 30 UTR isoform usage that modulated the presence ofmiRNA sites.More inclusive examination of alternative 30 UTR isoforms revealed that they in- fluence 10% of predicted targets when comparing any two cell types. Indeed, considering alternative 30 UTR isoform usage improved prediction of target- ing efficacy significantly beyond the improvements observed when considering constitutive isoform usage. Thus, although miRNA targeting is remark- ably consistent in different cell types, considering the 30 UTR landscape helps predict targeting efficacy and explain differential regulation that is observed. INTRODUCTION The control of gene output can be complex, with opportunities for regulation at each step of mRNA production, processing, localization, translation, and turnover. A widespread type of posttranscriptional control is that mediated by microRNAs (miRNAs) (Bartel, 2009). By base-pairing with complementary sites in their targets, miRNAs direct the repression of mRNAs, primarily through mRNA destabilization (Baek et al., 2008; Guo et al., 2010; Hendrickson et al., 2009). With each family of miRNAs capable of targetingmessages from hundreds of genes, and over half of the human transcriptome containing preferen- tially conserved miRNA sites (Friedman et al., 2009), miRNAs are expected to impact essentially every mammalian develop- mental process and human disease. Central for understanding this pervasive mode of genetic con- trol is understanding miRNA-target interactions. One factor affecting the efficacy of miRNA-target interactions is the miRNA site type. Site types are primarily classified based on the extent to which they match the 50 region of the miRNA. 6mer sites perfectly pair to only the miRNA seed (nucleotides 2–7 of the miRNA) and typically confer marginal repression, at best. Seed pairing can be augmented with an adenosine opposite miRNA nucleotide 1 or a Watson-Crick pair with miRNA nucleotide 8, giving a 7mer-A1 or 7mer-m8 site, respectively; sites augmented with both the adenosine and the match to nucleotide 8 are 8mer sites (Grimson et al., 2007; Lewis et al., 2005). On average, 8mer sites are more efficacious than 7mer-m8 sites, which are more efficacious than 7mer-A1 sites, with supplemental pairing to the 30 region of the miRNA marginally increasing efficacy of each site type (Grimson et al., 2007). Two other site types are effective but so rare that together they are thought to constitute less than 1% of all targeting; these are 30 compensatory sites (Bartel, 2009) and centered sites (Shin et al., 2010). Offset 6-mer sites and each of the more recently proposed noncanoni- cal site types (Betel et al., 2010; Chi et al., 2012; Helwak et al., 2013; Khorshid et al., 2013; Loeb et al., 2012; Majoros et al., 2013) are either not effective or less effective than 6-mer sites (Friedman et al., 2009) (V.A. and D.P.B., unpublished data). Early target predictions considered only the number and type of sites to rank predictions and thus had to rely on site conserva- tion to refine the rankings (Bartel, 2009). However, the same site Molecular Cell 53, 1031–1043, March 20, 2014 ª2014 Elsevier Inc. 1031 vier 207 I can bemuchmore effective in the context of one mRNA than it is in the context of another; identifying and considering these context features surrounding the miRNA site can improve target predictions (Grimson et al., 2007; Gu et al., 2009; Kertesz et al., 2007; Nielsen et al., 2007). As part of the context model, three context features were originally used to improve the TargetScan algorithm: (1) the local AU content of the sequence surrounding the site (presumably a measure of occlusive secondary struc- ture), (2) the distance between the site and the closest 30 UTR end, and (3) whether or not the site lies in the path of the ribo- some (Grimson et al., 2007). With these features of UTR context in the model, effective sites could be predicted above the false positives without considering the evolutionary conservation of the site (Baek et al., 2008; Grimson et al., 2007). Additional im- provements came with development of the context+ model, which incorporated two features of the miRNA seed region: (1) the predicted stability of matches to the seed region, which correlated with efficacy, and (2) the number of matches to the seed region within the 30 UTRs of the transcriptome, which inversely correlated with efficacy (Garcia et al., 2011). Despite the advances of the past decade that have come from defining the site types and building models of miRNA-targeting efficacy that consider (1) the influences of site type and number, (2) the 30 UTR context of the site, and (3) certain miRNA proper- ties, the accuracy of miRNA-target predictions still has substan- tial room for improvement. One consideration currently ignored in miRNA-targeting models is the potential influence of different biological and cellular contexts. Although predictions for miRNAs or mRNAs that are not present in the cell can be easily disregarded, other influences of cellular context are undoubtedly exerting effects in ways that compromise prediction utility. One way that cellular context can exert its effect is through dif- ferential expression of mRNA-binding proteins, which can either increase or decrease the efficacy of miRNA sites. For instance, binding of Pumilio increases miRNA-mediated repression in the 30 UTRs of the p27 and E2F3 mRNAs (Kedde et al., 2010; Miles et al., 2012), whereas Dnd1 binding occludes miRNA target sites to relievemiRNA-mediated repressionofnanosand tdrd7mRNAs (Kedde et al., 2007). These examples could represent just the tip of the iceberg, as the extent to which differential expression of such trans-acting factors affects miRNA targeting in different cell types has not been investigated across the transcriptome. Another consideration largely ignored in miRNA target predic- tions is the impact of alternative 30 UTR isoforms, which are generated through alternative cleavage and polyadenylation (APA). For example, mRNAs with the same open reading frame (ORF) often have tandemUTR isoforms in which APA at proximal or distal poly(A) sites generates shorter or longer 30 UTRs, respectively (Miyamoto et al., 1996; Tian et al., 2005). Regulatory elements, such as miRNA sites, in the commonly included (or ‘‘constant’’) region are present in both short and long isoforms, but those in the alternatively included (or ‘‘variable’’) region are present only in the long isoform, and thus a cell-type-specific shift in APA results in a corresponding shift in isoforms respond- ing to the regulation (Ji et al., 2009; 2011; Mayr and Bartel, 2009; Sandberg et al., 2008; Ulitsky et al., 2012). Development of high- throughput poly(A)-site mapping techniques, such as 3P-seq (poly[A]-position profiling by sequencing; Jan et al., 2011), has allowed quantitative and precise detection of alternative 30 UTR usage within a sample as well as differences over the course of development (Derti et al., 2012; Hoque et al., 2013; Jan et al., 2011; Lianoglou et al., 2013; Shepard et al., 2011; Ulit- sky et al., 2012; Spies et al., 2013). Efforts to predict miRNA tar- gets are only beginning to incorporate this information. For example, when predicting mammalian targets, the most recent version of TargetScan still considers only the longest annotated 30 UTR isoform of each gene.When predicting nematode and ze- brafish targets, TargetScan predicts the targeting of each 3P- seq-annotated UTR isoform but does not consider the relative abundance of each isoform when ranking these predictions. The studied examples of differential expression of RNA-bind- ing proteins and differential usage of 30 UTR isoforms imply that these, or perhaps other phenomena, might broadly influence the impact of miRNAs, causing the targets of a miRNA to substan- tially differ in two different cellular contexts, even when only considering mRNAs expressed in both cell types. Genome- wide studies of transcription factor binding show that cell type can influence transcriptional regulation (Cooper et al., 2007; Farnham, 2009), but global effects of cellular context on miRNA regulation or other forms of posttranscriptional regulation have not been reported. Understanding the frequency and magnitude of these effects is important for understanding the degree to which miRNA regulation itself is regulated. Knowing the extent to which experimental observations from one cell type can be extrapolated to another also has practical value for placing miRNAs into gene regulatory networks. For example, the heter- ologous reporter assay (in which the 30 UTRof a suspected target is appended to a reporter gene and tested for its response to the miRNA, with and without mutation of the putative miRNA-bind- ing sites) is a workhorse for testing the plausibility of proposed miRNA-target interactions, but its utility would be diminished if the sites that mediate repression in one cell type do not reliably do so in other cell types. To begin to explore the frequency and magnitude of cell-type- specific effects on miRNA-mediated repression, we introduced the same miRNAs into three different human cell lines and moni- tored mRNA changes by RNA-seq. We also analyzed the effects of miRNA loss in different mouse and zebrafish tissues and stages. Most predicted targets responded similarly in different cellular contexts, and for those that did differ, these differences often resulted from secondary effects, not direct differences in miRNA-mediated targeting. When direct differences in targeting were detected, these differences often resulted from alternative 30 UTR isoform usage. Experimental profiling of poly(A) sites showed that APA affects 10% of predicted targets when comparing any pair of cell types. With this in mind, we incorpo- rated 30 UTR isoform usage as a parameter in miRNA target pre- diction and found that it significantly improved performance. RESULTS Most miRNA-Target Interactions Are Not Detectably Affected by Cell Type To determine the extent to which cell type influences miRNA targeting, we transfected two different miRNA duplexes (miR- 124 and miR-155) into three different cell lines (HeLa, human Molecular Cell Effects of Cellular Context on miRNA Repression 1032 Molecular Cell 53, 1031–1043, March 20, 2014 ª2014 Else nc. embryonic kidney 293 [HEK293], and Huh7 cells) and monitored mRNA changes using mRNA-seq. These cell lines were chosen for two reasons: (1) they had large differences in their expression of endogenous miRNAs (Landgraf et al., 2007; Mayr and Bartel, 2009), and (2) they could be transfected at high efficiency. For each miRNA/cell line combination, we examined two biological replicates, comparing the effects of the miRNA transfection rela- tive to those of the mock-treated controls. Each of these trans- fection data sets exhibited the expected global targeting effects, as determined by analysis of fold changes for site-containing mRNAs (Figure S1A available online) and by unbiased analysis using the Sylamer tool (Figure S1B) (van Dongen et al., 2008). After the data were globally normalized to correct for general cell-type differences, as well as for experimental and technical biases, we investigated if the differences observed between the cell types were significant, given the variance between repli- A 0 1 2 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 miR-124: genes with sites HeLa change (log2) H EK 29 3 ch an ge (lo g 2) 0 1 2 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 miR-124: genes with sites HeLa change (log2) H uh 7 ch an ge (lo g 2) 0 1 2 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 miR-124: genes with sites HEK293 change (log2) H uh 7 ch an ge (lo g 2) 0 1 2 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 miR-155: genes with sites HeLa change (log2) H EK 29 3 ch an ge (lo g 2) 0 1 2 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 miR-155: genes with sites HeLa change (log2) H uh 7 ch an ge (lo g 2) 0 1 2 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 miR-155: genes with sites HEK293 change (log2) H uh 7 ch an ge (lo g 2) B 0 1 2 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 miR-124: genes without sites HeLa change (log2) H EK 29 3 ch an ge (lo g 2) 0 1 2 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 miR-124: genes without sites HeLa change (log2) H uh 7 ch an ge (lo g 2) 0 1 2 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 miR-124: genes without sites HEK293 change (log2) H uh 7 ch an ge (lo g 2) 0 1 2 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 miR-155: genes without sites HeLa change (log2) H EK 29 3 ch an ge (lo g 2) 0 1 2 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 miR-155: genes without sites HeLa change (log2) H uh 7 ch an ge (lo g 2) 0 1 2 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 miR-155: genes without sites HEK293 change (log2) H uh 7 ch an ge (lo g 2) Total = 2419 n = 1169 (4) FDR = 0.267 Total = 1987 n = 1098 (13) FDR = 0.205 Total = 2067 n = 1164 (2) FDR = 0.131 Total = 1714 n = 991 (137) FDR = 0.335 Total = 1280 n = 921 (92) FDR = 0.377 Total = 1361 n = 1037 (29) FDR = 0.241 Total = 1082 n = 238 (0) FDR = 0 Total = 968 n = 218 (0) FDR = 0 Total = 933 n = 236 (3) FDR = 0.313 Total = 1813 n = 328 (13) FDR = 0.381 Total = 1693 n = 335 (2) FDR = 0.399 Total = 1778 n = 308 (3) FDR = 0.354 Figure 1. Most miRNA-Target Interactions Are Unaffected by Cell Type (A) Pairwise comparisons of mRNA changes after transfecting the same miRNA into different cell lines. Shown are changes for genes with at least 1 7mer 30 UTR site for the indicated miRNA, plotting the results for genes expressed in both cell lines. The region corresponding to a log2 change > –0.3 is shaded (gray); n, number of genes outside the gray region. Genes significantly differentially repressed are highlighted (blue) and tallied (num- ber in parentheses). In some cases, not all of the differentially repressed genes fit within the plots. (B) These panels are as in (A), but for control genes. For the miR-124 transfections, mRNA changes are plotted for genes with miR-155 sites (excluding any that contained sites to both miRNAs) and vice versa. cate experiments. To do so, an expected difference was estimated using a permu- tation test for each target mRNA (Tusher et al., 2001). Then, a delta value (D)—the difference between these observed and expected values—was calculated. This D value thus combines both the magni- tude of the difference between the cell lines and the variability associated with each measurement (Figure S1C), and so as it increases, the statistical confidence in differential regulation also increases. Importantly, for all pairs of cell lines that we investigated, on average 1.1% (12) and 5.8% (57) of predicted targets (with a log2 change < 0.3 in either sample) were differentially repressed with a D R 0.2 for miR-124 and miR-155, respec- tively (Figures 1A and S1D; Table S1). In contrast, on average, 0.1% and 0.3% of genes with control sites were affected differentially by miR-124 and miR-155, respectively (Figure 1B). The lower frac- tion of significantly differential targets for miR-124 targeting is partly due to a higher variance between replicates observed with miR-124 targeting (Figure S1E). In some miR-124 compari- sons, hardly any predicted targets were differentially repressed at these cutoffs. For example, when comparing the effects of miR-124 in HeLa and HEK293 cells, only 4 of 1,169 coexpressed predicted targets (with log2 change <0.3 in either sample) were significantly differentially repressed (false discovery rate [FDR] = 0.267; Figure 1A). In the miR-155 pairwise comparisons, more, but still only aminority, of the predicted targets were differentially affected. For instance, when comparing effects of miR-155 in HeLa and HEK293 cells, 137 of the 991 coexpressed targets were differentially regulated (D R 0.2, FDR = 0.335; Figure 1A). Similar results were obtained when we examined the effect of miR-124 in IMR90 cells, a normal diploid fibroblast cell line (Fig- ure S1F). Together, these data suggest that, although the 208 lecul Molecular Cell Effects of Cellular Context on miRNA Repression Mo ar Cell 53, 1031–1043, March 20, 2014 ª2014 Elsevier Inc. 1033 vier 209 I repression of some targets differs between cell lines, themiRNA- mediated repression of most targets is not detectably affected by the cellular environment. 30 UTR Isoforms in Different Cell Types and Tissues Because APA can affect the inclusion of regulatory sites in the 30 UTR, we reasoned that some of the observed differential repres- sion was due to differential use of alternative 30 UTRs. To identify these cases, 3P-seq was used to quantify poly(A)-site usage in the three human cell lines (HeLa, HEK293, and Huh7), as illus- trated for LRRC1 (Figure 2A). The accuracy of 3P-seq for quanti- fying alternative isoforms, previously inferred by its high accuracy in quantifyingmRNA levels (Spies et al., 2013;Ulitsky et al., 2012), was further confirmed by comparison to the results of 30-seq (Lia- noglou et al., 2013), which has been extensively validated with RNA blots (Figures S2A–S2D). Although human 30 UTRs are rela- tivelywell annotated, our analysis improved these annotations: of the mRNAs with poly(A) sites supported by at least ten 3P tags, A D E F G B C Figure 2. The 30 UTR Landscape Affects miRNA Targeting (A) Different AIRs for miR-124 sites in the LRCC1 gene in different cell types. Shown is the RefSeq annotation track of LRCC1 (dark blue), with the associated 3P tags from the three cell lines assayed (above) and the corresponding AIRs (below). (B and C) Extent to which APA affects miRNA site inclusion. Shown are the number and percentage of sites for which AIRs for miR-124 (B) or miR-155 (C) change by at least 0.3 in each pair-wise cell- type comparison. The arrows point to the cell line with the higher AIR, and the width is proportional to the number of sites with differential AIR. (D–G) Relationship between AIR and miRNA- mediated repression. For each site type—8mer (D), 7mer-m8 (E), 7mer-A1 (F), and a representative pair of control sites (G)—predicted targets were binned by their AIR. For each bin, the mean fold- changemediated by either miR-124 ormiR-155 for each transfection of the various cell lines (HEK293, HeLa, and Huh7) is plotted. The red line is the least-squares best fit to the data (Pearson r2, F test). 30% had major 30 UTR isoforms that were shorter than the RefSeq annotation, and 10% had major isoforms that were longer (Table S2C). Moreover, similar to previous studies (Derti et al., 2012; Hoque et al., 2013; Smibert et al., 2012; Ulitsky et al., 2012), we found that in each cell type, over half (51%–63%) of the genes with 3P-seq-supported poly(A) sites had multiple tandem isoforms that were each supported by at least 1% of the tags (Fig- ure S2E), and 10,701 (70.1%)mRNAs dis- played APA in at least one cell type. To confirm that this isoform heteroge- neity resembled that found in other verte- brates, we used our pipeline to analyze 3P-seq data sets from two mouse cell lines (mouse embryonic stem cells [mESCs] and NIH 3T3 cells; Tables S2D and S2F) and published data sets from zebrafish tissues (brain, ovary, and testes) and devel- opmental stages (2, 6, 24, and 72 hr postfertilization [hpf] and adult) (Ulitsky et al., 2012). As with human poly(A)-site usage, these data sets allowed further refinement of 30 UTR ends from those currently annotated in RefSeq (30% and 40% in mouse and zebrafish, respectively; Tables S2G–S2I). Overall, the frac- tion of mRNAs with multiple tandem 30 UTR isoforms was similar when comparing different cell lines, tissues, and vertebrate ani- mals (Figures S2E–S2G). Alternative Cleavage and Polyadenylation Affects miRNA Targeting By quantitatively measuring poly(A)-site usage, the 3P-seq data sets allow examination of how APA varies in different cellular contexts (Ulitsky et al., 2012). When comparing the 4 Molecular Cell Effects of Cellular Context on miRNA Repression 1034 Molecular Cell 53, 1031–1043, March 20, 2014 ª2014 Else nc. 210 lecul human cell lines, 1,708 (11.2%) of the mRNAs had different dominant 30 UTR ends (Figure S2H), and when comparing weighted 30 UTR lengths, each cell type had a unique 30 UTR length distribution (Figures S2I–S2K). Among the human cell lines examined, Huh7 cells tended to have the shortest 30 UTRs, and HEK293 cells the longest. Moreover, although the percentage of genes with multiple UTR isoforms was relatively constant between cell types, the identities of these genes and the poly(A) sites used were more variable. Indeed, of the 7,563 mRNAs with multiple poly(A) sites in all 4 human cell lines, 51.2% had weighted 30 UTR lengths that changed by more than 100 nt (Figure S2L). As reported previously (Ulitsky et al., 2012), weighted 30 UTR length differences were especially apparent during zebrafish development and in two mouse cell lines (Figures S2M and S2N). Taken together, these results confirmed that many transcripts have alternative 30 UTR iso- forms and that 30 UTR lengths change across different vertebrate cell types and developmental stages. To determine the extent to which APA affects miRNA target- ing, we developed a metric called the affected isoform ratio (AIR), which, for each miRNA target site, indicates the fraction of mRNA transcripts containing that site (Figure 2A). To calculate AIRs, we first estimated the fraction of each tandem isoform based on the fraction of 3P tags at its poly(A) site relative to all the tags that mapped to the poly(A) sites contained within that exon (Figure 2A). These isoform fractions were then used to compute the 30 UTR isoform ratio for different UTR regions in which each constant region (present in all the tandem isoforms) had an isoform ratio of 1.0, whereas each variable region had an isoform ratio corresponding to the sum of the isoform fractions spanning that region (Figure 2A). For each miRNA site, the AIR was simply the isoform ratio at the region of the UTR containing the site. Consistent with Huh7 cells generally expressing shorter 30 UTR isoforms, of 30 UTR sites for the miR-124, 154 and 191 had lower AIRs (AIR difference R 0.3) in Huh7 cell lines than in HeLa and HEK293 cells, respectively, but only 67 and 41 sites had higher AIRs (Figure 2B). A similar result was observed with miR-155 sites (Figure 2C). To compare how miRNA targeting efficacy was affected by APA within a cell type, genes with multiple 30 UTR isoforms were first partitioned by their site type; for genes containing multiple sites, the best site type was chosen (with 8mer > 7mer-m8 > 7mer-A1). Within each site-type partition, genes were binned by their AIRs, and the efficacies of sites within each bin were compared. For each of the three site types, mean repression correlated with AIR such that sites with higher AIRs were more repressed than those with lower AIRs (Figures 2D–2G). Indeed, genes with sites having an AIR less than 0.25 were barely repressed by the corresponding miRNA. Similar re- sults were obtainedwith a large precompiledmicroarray data set of miRNA/siRNA transfections (Garcia et al., 2011) (Figure S2O). When the analysis was repeated 100 times, each time with a different negative-control cohort in which genes lacking any target sites (including 6mers) were selected and partitioned based on a randomly selected pseudosite (e.g., Figure 2G), repression and AIR never significantly correlated. Sites near the middle of long 30 UTRs mediate less repression than those at the ends (Grimson et al., 2007). The distance be- tween the site and the nearest end of the 30 UTR (referred to as the minimum distance) is a feature incorporated into the model of site efficacy used by TargetScan to rank target predictions (Garcia et al., 2011; Grimson et al., 2007). Because this mini- mum-distance feature depends on the poly(A) site, we reasoned that APA might change this feature for some miRNA sites, with a corresponding effect on site efficacy. When examining tran- scripts with sites with minimum distances 25 nt shorter in HEK293 cells than in HeLa cells, more repression was observed in HEK293 cells than in HeLa cells (Figure S2Q); importantly, these differences were not attributable to differential target-site inclusion because the AIRs for these sites were unchanged (<0.01). Correspondingly, genes with minimum distances that were longer in HEK293 cells were more repressed in HeLa cells, whereas genes not predicted to be targets were unaffected (Fig- ure S2Q). Together, these results indicate that APA, by short- ening and lengthening 30 UTRs, affects both the inclusion and the efficacy of miRNA sites. Incorporating Poly(A)-Site Usage Improves miRNA Target Prediction With the insights gained on the effects of APA on miRNA target- ing (Figure S3A), we developed a revised prediction model, called the ‘‘weighted context+’’ (or wContext+) model. This model produced a cell-type-specific score for each site by calculating its context+ score using TargetScan linear regression models for each of its context and miRNA features (Garcia et al., 2011) and then weighting this score by the AIR of the site in each cell type (Figure 3A). For each miRNA, the wContext+ scores of multiple sites were summed (disregarding positive scores) to generate the total wContext+ score for each gene, in which the scores with lower negative values indicated greater predicted repression. To assess the advantage of weighting the scores based on the AIRs, and thereby considering the isoform hetero- geneity of each cell type, we compared the performance of the wContext+model with those of the current context+model (Gar- cia et al., 2011) applied to a single 30 UTR isoform for each gene, choosing either (1) the longest isoform annotated by RefSeq, (2) the longest isoform determined by 3P-seq, or (3) the major 30 UTR isoform determined by 3P-seq. On average, the wContext+ model outperformed the previousmodel by50%, and although some of this improvement was attributable to more accurate identification of themajor 30 UTR isoforms, most was attributable to utilizing AIRs (Figure 3B). The wContext+ model also dis- played better sensitivity and specificity when evaluating area under the curve in receiver operating characteristic (ROC) plots (Figure S3B). Alternative Cleavage and Polyadenylation Is a Major Cause of Differential miRNA Targeting We next examined the extent to which differential poly(A)-site usage caused differential miRNA targeting. Between any pair of the human cell lines, the AIRs of 7%–10% of miR-124 sites and 7%–12% of miR-155 sites changed by >30% (Figures 2B and 2C). Similarly, 5%–9% of predicted miR-124 targets and 5%–10% of predicted miR155 targets had wContext+ scores differing by R0.1 (Figures S4A and S4B; Table S3). When we repeated this analysis in mouse (with predicted miR-155 and Molecular Cell Effects of Cellular Context on miRNA Repression Mo ar Cell 53, 1031–1043, March 20, 2014 ª2014 Elsevier Inc. 1035 vier 211 I miR-223 sites in mESCs and NIH 3T3 cells) and in zebrafish (with predicted miR-430 sites across the four developmental stages), similar ranges were observed, indicating that in diverse verte- brate species, APA affects 10% of predicted miRNA target sites when comparing two cell types (Figures S4C and S4D). Of the 126 predicted targets that were differentially repressed by miR-155, 11.1% had wContext+ scores with differences R0.03, a significant enrichment compared to that in nondifferen- tial miRNA targets (p = 0.004, hypergeometric test; Figure 4A). For example, theCHURC1 gene had 1 8mer and 2 7mer-m8 sites for miR-155, but these sites were only present in the longer of its two major isoforms (Figure 4B). Because the longer isoform was more prevalent in HeLa cells, 66% of CHURC1 transcripts con- tained miR-155 target sites in HeLa cells, whereas only 3% con- tained the sites in HEK293 cells (Figure 4B). The consequently large difference in wContext+ scores explained why this gene was repressed more strongly in HeLa than HEK293 cells (Fig- ure 4C). Reciprocally, the longer isoforms of the ATAD2B gene contained one 8mer and one 7mer-m8 site and were predomi- nately expressed in HEK293 cells, whereas the short isoform that lacked these regulatory sites was expressed in HeLa cells (Figure 4D), and this gene was repressed more strongly in HEK293 cells than in HeLa cells (Figure 4E). Similar examples illustrating cases in which APA explained differential miRNA tar- geting were found in all pairs of cell types examined (Figures 4F– 4I and S4E–S4Q). APA, however, did not explain most differentially repressed predicted targets (with D > 0.3; Table S3; Figure 4). These mRNAs might have responded differently because other cell- type-specific factors, such as RNA-binding proteins, differen- tially modulated site efficacy in the two cell types. Alternatively, these mRNAs might have had similar direct response to the miRNA and only appeared to be differentially repressed because of differential secondary effects of transfecting the miRNA. For example, in one cell type, the miRNA might have repressed a transcriptional repressor, causing increased transcription of the predicted target. Indeed, we observed that for many of these cases, mRNAs were in fact upregulated in one of the two cell lines (Figure S4P), supporting the idea that the differences were mediated by secondary effects rather than differential site efficacy. To distinguish between these possibilities, we used re- porter assays to determine the extent to which the miRNA sites themselves mediated differential repression. For 9 candidates, we placed either wild-type or mutated sites, embedded in 500 nucleotides of the surrounding 30 UTR, downstream of Renilla luciferase and compared the repression mediated by miR-155 in HEK293 and HeLa cells. Although six were signifi- cantly repressed by miR-155 in both cell lines, only two (LPIN1 and LMBRD2) were significantly differentially repressed (Fig- ure 4J; p = 0.0004 and 1.113 105, respectively, Mann-Whitney U test). Both were more repressed in HEK293 cells than in HeLa cells, consistent with the RNA-seq results. Although these two mRNAs are good candidates for APA-independent differential repression, the paucity of such candidates suggests that most instances of apparent differential repression are due to differen- tial secondary effects rather than to modulations of miRNA tar- geting efficacy. AIR Correlates with Site Efficacy for Targets of Endogenous miRNAs To extend our results to the effects of miRNAs in their endoge- nous contexts, we profiled both mRNA changes (by microarray) and poly(A)-site usage (by 3P-seq) in six different tissues (heart, kidney, liver, lung, muscle, and white adipose tissue [WAT]) from wild-type and miR-22 knockout mice (Table S4) (Gurha et al., 2012). As expected, predicted miR-22 targets were generally upregulated in the knockout tissues (Figure S5A). Although modest, this effect was significant in five of the six tissues (muscle, heart, kidney, liver, and WAT) and most pronounced for mRNAs with 8mer sites (Figure S5A). Using the 3P-seq data sets, we generated tissue-specific 30 UTR annotations. Interestingly, lung tissue had 1.5–2 times more poly(A) sites than did the other tissues and mouse cell lines (NIH 3T3 and mESCs), perhaps because of the more heteroge- neous nature of this tissue. As observed with exogenously delivered miRNAs, miRNA-mediated repression significantly correlated with the AIR for 8mer and 7mer-m8 sites, but not for negative-control sites (Figure 5A; p = 0.00056, 0.0012, and 0.880, respectively). An insignificant correlation for 7mer-A1 sites (p = 0.487) was attributed to the weak derepression observed overall in the miR-22 data sets, which made it difficult for a signal from this weaker site type to appear. With these tissue-specific 30 UTR annotations in mouse and published ones from zebrafish, we developed and evaluated A B Figure 3. TheWeighted Context+ Model Improves Target Prediction (A) Calculation of wContext+ scores. For each site, the context+ score, calculated using the TargetScan linear regression model, is weighted by a cell- type-specific AIR. For genes with multiple sites, the scores for each individual site are added to yield the total wContext+ score. (B) Improved performance of the wContext+ model. Plotted are r2 values calculated from the correlation (Pearson r) between score and observed change in the indicated transfection data set. For the previous model (context+), three different 30 UTR annotations were used: the RefSeq anno- tation (dark blue); the longest isoform, as determined by 3P-seq (light blue); and the major isoform, as determined by 3P-seq (purple). Molecular Cell Effects of Cellular Context on miRNA Repression 1036 Molecular Cell 53, 1031–1043, March 20, 2014 ª2014 Else nc. 212 lecul wContext+models for miR-22 targeting inmice andmiR-430 tar- geting in zebrafish embryos. Although the overall repression differed in magnitude from that observed for the exogenous miRNAs in human cells, with the magnitude of endogenous miR-22 repression being much lower, and that of endogenous miR-430 being much higher, the results resembled those A B C D E F G H I J Figure 4. Differential miRNA-Mediated Repression Is Often Due to Alternative 30 UTR Isoform Usage (A) Genes with differential AIRs are enriched in genes that are differentially repressed. This panel is as in Figure 1A, but highlighting genes with significantly different repression that also have wContext+ score differencesR0.03 (orange). (B) Higher AIR of CHURC1 miR-155 sites in HeLa compared to HEK293 cells. Otherwise, this panel is as in Figure 2A. (C) Greater miR-155 repression ofCHURC1 in HeLa cells. Plotted are the wContext+ and expression change forCHURC1 in HeLa (pink) and HEK293 (blue) cells. (D) This panel is as in (B), except for ATAD2B, a gene with higher AIR and greater miR-155 repression in HEK293 cells. (E) This panel is as in (C), except for ATAD2B, a gene with higher AIR and greater miR-155 repression in HEK293 cells. (F) This panel is as in (A), except comparing changes mediated by miR-124 in HeLa and HEK293 cells. (G) This panel is as in (C), except for ANTXR2, a gene with higher AIR and greater miR-124 repression in HeLa cells. (H) This panel is as in (A), except comparing changes mediated by miR-124 in HEK293 and HeLa cells. (I) This panel is as in (C), except for CLDN1, a gene with higher AIR and greater miR-124 repression in HeLa cells. (J) Direct measurements of miR-155-mediated repression of 30 UTR segments from nine genes initially classified as differentially regulated, despite having similar AIRs.Renilla luciferase reporters followed by 30 UTR segments (with either wild-type ormutatedmiR-155 sites) from the indicated geneswere transfected into either HeLa or HEK293 cells in the presence of the cognate (miR-155) or a noncognate (miR-1) miRNA. Five genes were originally repressed more in HeLa cells in the genome-wideanalyses (highlighted inpink), and fourwereoriginally repressedmore inHEK293cells (highlighted inblue).Plottedare thenormalizedrepressionvalues, with error bars representing the third largest and third smallest values. Significance was calculated with theMann-Whitney U test (*p < 0.05, **p < 0.01, ***p < 0.001). Molecular Cell Effects of Cellular Context on miRNA Repression Mo ar Cell 53, 1031–1043, March 20, 2014 ª2014 Elsevier Inc. 1037 vier 213 I observed for targeting by exogenous miRNAs, with the wContext+ model outperforming the context+ model for all tis- sues except the kidney (Figure 5B). The greatest difference was observed in the zebrafish embryo, where the wContext+ model outperformed the context+model bymore than 70% (Fig- ure 5C, r2 = 0.194 and 0.112, respectively). As in human cell lines, some of the improvement was attributable to more accurate identification of the major 30 UTR isoforms, but most was attrib- utable to considering the AIRs, which capture the heterogeneity of the 30 UTR landscape. Alternative Cleavage and Polyadenylation Causes Differential Repression by Endogenous miRNAs To determine the extent to which repression by miRNAs in their endogenous contexts varies between different tissues, we applied the D value score to the miR-22 data sets, focusing on the five tissues with significant repression. Although fold-change signals were more variable and weaker than those observed in the human cell lines, as judged by a higher D value cutoff, a similar fraction of predicted targets showed differential repres- sion in any pairwise comparison (7.7%, on average; Figures S5B–S5F and Table S5). For instance, in comparing repression mediated by miR-22 in liver and heart cells (Figure S5C), 74 of 545 genes with 7mer or 8mer sites in their 30 UTRs were differen- tially repressed (13.6%). For each pair of cell types, APA affected a significant fraction of differentially repressed predicted targets (Figures S5G–S5K, p = 1.03 1016 to 0.027). For instance, when comparing muscle and heart cells, APA explained 12.3% of differentially repressed targets (Figure S5G, p = 0.027). Mycbp, an example of such a target, was effectively targeted in muscle cells, where its longer isoform was more expressed, but not in the heart, where a shorter isoform predominated (Figure S5G). Reciprocally, Ctnnal1 was more effectively targeted in heart cells, where its longer isoform was more expressed, than in the muscle (Fig- ure S5G). Thus, as with exogenously delivered miRNAs, APA ex- plained some of the observed differential repression. 30 UTR Heterogeneity Measured in One Cell Type Improves the Targeting Model for Other Cell Types Despite clear examples of cell-type-specific 30 UTR heterogene- ity (Figures 2 and 4), AIRs were often similar in diverse cells or tissues, suggesting that for cells in which AIRs cannot be calcu- lated (due to the lack of 3P-seq data), AIRs from other cell types of the same species might still improve the targeting model. To test this idea, we evaluated wContext+ models that were based on noncognate human and mouse cell types with expression changes by miRNAs observed in the cognate cells. Importantly, wContext+ models based on the other cell types still outper- formed the previous model (Figures 6A and 6B), presumably because the advantage of considering constitutive isoform ratios more than offset any disadvantage of training on noncognate alternative ratios. We then developed a murine wContext+ model, using AIRs calculated from 3P-seq analysis of mESCs and NIH 3T3 cells, and evaluated this model using data reporting mRNA changes after deleting either miR-223 or miR-155 (Guo et al., 2010; John- nidis et al., 2008; Rodriguez et al., 2007). As observed for cognate cells, AIR and targeting efficacy were correlated such that sites with higher AIRs in mESCs or 3T3 cells were more derepressed in the knockout data sets (data not shown). More- over, despite being based on noncognate AIRs from mESCs A B C Figure 5. Alternative 30 UTR Isoform Usage Affects Targeting by Endogenous miRNAs (A) Relationship between AIR and endogenous repression bymiR-22. This panel is as in Figures 2D–2G, but comparingmRNA changes inmouse tissues (muscle, heart, liver, kidney, white adipose tissue [WAT], and lung) with and without miR-22. (B) Improved performance of the wContext+ model for predicting endogenous miR-22 targeting in mice. Otherwise, this panel is as in Figure 3B. (C) Improved performance of the wContext+ model for predicting endogenous miR-430 targeting in zebrafish embryos. This panel is as in Figure 3B, except analyzing predicted miR-430 targets in wild-type embryos and embryos that lack miR-430 (MZ-Dicer) at 9 hr postfertilization (hpf). Molecular Cell Effects of Cellular Context on miRNA Repression 1038 Molecular Cell 53, 1031–1043, March 20, 2014 ª2014 Else nc. 214 lecul and NIH 3T3 cells, the wContext+ model outperformed context+ models for miR-155 and miR-223 targeting in different cell types (Figure 6C). These results extended our conclusions to additional instances of endogenous miRNA targeting. More importantly, they extended the practical utility of considering isoform hetero- geneity, showing that by exploiting similarities of isoform ratios between different cell types, this approach can improve predic- tions of targeting efficacy, even in cell types for which detailed information on isoform heterogeneity has not yet been acquired (which is the vast majority of cell types). This being said, wContext+ models performed best when tested on the cell type for which the isoform data had been acquired (Figures 6A and 6B), presumably because extrapola- tion of isoform information from one cell type to another fails to capture key instances in which differential APA causes cell- type-specific targeting. Indeed, when we repeated this compar- ison, but this time excluding all genes initially classified as differential targets, the cognate model still outperformed that based on other cell types (Figure S6). Thus, differential APA broadly underlies cell-type-specific targeting, affecting even those genes that were not identified in our initial analysis as being differentially regulated because the differences did not exceed our threshold for statistical significance. miRNA Targeting Can Affect the 30 UTR Landscape Having found that alternative isoform usage influenced miRNA targeting, we tested whether the reciprocal relationship could also be detected: does miRNA-mediated repression influence isoform usage? To examine the effects of miR-22 on the 30 UTR landscape, we compared 3P-seq data sets generated from wild-type and miR-22 knockout mice for the five tissues A B C Figure 6. Considering Isoform Ratios Im- proves the Model of miRNA Targeting in Noncognate Cell Types (A) The performance of non-cell-type-specific wContext+ models for exogenous miRNAs. A comparison of performance of the original context+ model (dark blue), the cell-type-specific wContext+ model (pink), and the wContext+ model based on 3P-seq from other cell types (gray; error bars, SD). Otherwise, this panel is as in Figure 3B. (B) This panel is as in (A), but for endogenous tar- geting by murine miR-22. (C) Non-cell-type-specific wContext+ model im- proves prediction of endogenous targeting medi- ated by miR-223 in neutrophils and miR-155 in B and Th1 cells. Otherwise, this panel is as in (A). in which significant miR-22 repression was observed (heart, kidney, liver, mus- cle, and WAT). For all of these tissues, predicted targets with sites in the variable region had longer weighted 30 UTRs in the miR-22 knockout mice. This lengthening was significant in comparison to control sites (Figure 7; p = 0.0001–0.0096), consistent with a model in which the longer isoform(s) are specifically targeted and repressed in wild-type, but not mutant, cells. We obtained similar results when using 3P tags to quantify the preferential targeting of the longer isoform of genes containing a site in their variable region (Figure S7A and S7B). We also examined the effects of miR-430 in zebrafish embryos, which robustly represses its targets during the maternal-to-zygotic transition (Giraldez et al., 2006). Similar to that observed with murine miR-22, the 30 UTR landscape was shaped by miR-430 (Figures S7C–S7E). Consistent with a model in which isoform usage has already been shaped by miR-430 repression by 6 hpf, wContext+ scores calculated with 2 hpf 3P-seq data were more predictive of miRNA-dependent expres- sion changes than those calculated with 6 hpf 3P-seq data (Fig- ure S7F). Together, these results demonstrate that repression by miRNAs in the cytoplasm helps shape the relative expression of UTR isoforms and highlights the interplay between these two processes. DISCUSSION Differential expression of miRNAs and their mRNA targets clearly provides an important mechanism to influence the target reper- toire of the miRNAs. Less clear has been the extent to which different cellular contexts additionally influence the targeting of coexpressed mRNAs by coexpressed miRNAs. For both endog- enously and exogenously expressed miRNAs, we found rela- tively few site-containing, coexpressed genes with detectable cell-type-specific differences in their responses. When identi- fying a target as responding differently in two cellular contexts, we considered the variance as well as the magnitude of the Molecular Cell Effects of Cellular Context on miRNA Repression Mo ar Cell 53, 1031–1043, March 20, 2014 ª2014 Elsevier Inc. 1039 vier 215 I difference in repression. One implication of this approach is that as the number or accuracy of those measurements increases, the lowered experimental uncertainty will enable additional dif- ferential targets to be identified. However, our result of an overall uniformity of target repression will not change, as most magni- tudes of the newly detected differences will be smaller than those currently detected. For those targets that responded differentially, one important mechanistic explanation is differential 30 UTR isoform usage that influences either the inclusion of sites or their placement within more or less favorable contexts. Site-containing genes that were affected by differential 30 UTR isoform usage were signifi- cantly enriched in the differentially repressed set. Furthermore, differential isoform usage presumably affects many additional genes that have differences too modest to be confidently iden- tified in our initial analysis of differentially expressed genes. Indeed, when comparing 30 UTR isoforms observed in any two cell types, approximately 10% of predicted targets are likely to be affected by differential usage. Moreover, cognate wContext+ models outperformed models that considered constitutive isoform ratios (but not the cognate cell-type- specific ratios), which demonstrated the importance of cell- type-specific APA events on miRNA targeting, even for targets that were not originally identified as responding differentially (Figure S6). More generally, despite known inter- and intracellular hetero- geneity in the 30 UTR landscape and the corresponding effects on regulatory site inclusion (Derti et al., 2012; Hoque et al., 2013; Mayr and Bartel, 2009; Sandberg et al., 2008; Smibert et al., 2012; Ulitsky et al., 2012), miRNA-target prediction has, until this study, largely ignored the effects of alternative isoform usage. With transcriptome-wide cell-type-specific 30 UTR anno- tation becoming more common, wContext+ models might even- tually be generated for each tissue or cell line of interest. In the meantime, for the many cell types for which such annotations are not yet available, predicting targets using isoform data from noncognate cell types still improves performance over previous algorithms because it enables consideration of consti- tutive isoform ratios. Accordingly, the next version of TargetScan will implement a non-cell-type-specific wContext+ model for human, mouse, and fish predictions. Studies to understand the mechanisms underlying the defini- tion of the 30 UTR landscape have focused primarily on nuclear events—i.e., cleavage and polyadenylation—since these are the prime contributors in determining 30 UTR isoform usage (Berg et al., 2012; Bhattacharjee and Bag, 2012; Lee et al., 2007). Nevertheless, we show that cytoplasmic events also shape this landscape by differentially modulating the stability of short and long isoforms. Repression mediated by miR-22 had statistically significant effects on the 30 UTR landscape in somatic tissues, but the effect of miRNA targeting was most apparent in zebrafish embryos, where targeting by miR-430 is especially robust. Perhaps the interplay between miRNA target- ing and 30 UTR isoform usage has the greatest biological impact during tightly regulated spatiotemporal processes, such as early embryonic development. A B C D E Figure 7. Repression by miR-22 Shapes the 30 UTR Landscape (A–E) Influence of miR-22 targeting on 30 UTR isoform usage. Weighted 30 UTR lengths were determined using 3P-seq data from heart (A), liver (B), muscle (C), kidney (D), and WAT (E). Plotted are the cumulative distributions of the differences in lengths (subtracting that of the wild-type tissue from that of the miR-22 knockout tissue) for geneswith control sites in the variable region (gray) and thosewithmiR-22 sites in the variable region (red). Significancewas determined using the Kolmogorov-Smirnov test. Molecular Cell Effects of Cellular Context on miRNA Repression 1040 Molecular Cell 53, 1031–1043, March 20, 2014 ª2014 Else nc. 216 lecul The other mechanisms that might account for cell-type-spe- cific effects of the miRNA can be grouped into two categories, those involving actual differences in targeting itself and those mediated through secondary effects of introducing the miRNA. To distinguish between these two possibilities, we used luciferase assays to isolate miRNA-mediated repression from secondary effects, focusing on nine predicted targets that responded differently to the miRNA despite uniform AIRs in the two cell types. Only two of the nine retained differential tar- geting in the luciferase assay, suggesting that most differential effects not explained by alternative isoform ratios were the result of secondary effects. These two genes, LPIN1 and LMBRD2, are interesting candidates for future work in under- standing, at the molecular level, how differences in cellular context mediate differences in miRNA-target interactions. Nonetheless, our observation of so few instances in which dif- ferential targeting explained differential effects suggests that miRNA targeting is remarkably uniform between cell types and that a miRNA-target interaction identified in one cellular context will generally hold in other contexts in which the target site is present (i.e., has a high AIR) and the miRNA is expressed at a level sufficient to guide repression. Perhaps some miRNAs have target repertoires more substan- tially affected by different cellular contexts, but we were unable to identify any in our study, although we examined exogenously and endogenously expressed miRNAs in a variety of tissues in three different vertebrates. Indeed, in light of our results, the initial example of differential targeting—that of Dnd1 modulating miR-430 repression (Kedde et al., 2010)—is now all the more striking, as it appears to represent the exception rather than the rule. Perhaps cellular contexts affect other types of posttran- scriptional pathways to a greater extent. Are other regulatory programs (such as that mediated by AU-rich elements) primarily modulated by APA, or are these primarily influenced by the expression of other 30 UTR-binding proteins? These remain important and unanswered questions for our understanding and prediction of posttranscriptional regulation. EXPERIMENTAL PROCEDURES Cell Culture HEK293 (ATCC), HeLa (ATCC), and Huh7 (Health Science Research Resource Bank) cells were cultured as recommended by themanufacturer in Dulbecco’s modified Eagle’s medium (DMEM) supplemented with 10% fetal bovine serum (Clontech) and penicillin/streptomycin. Plasmids Plasmids were constructed as described (Supplemental Information). miRNA Transfections Cells were transfected with Lipofectamine 2000 (Invitrogen) and 100 nM miRNA duplex or pUC19, as recommended by the manufacturer. After 24 hr, cells were harvested, and RNA was extracted using TRI Reagent (Life Technologies). RNA-Seq Library Preparation After RNA isolation, poly(A)+ RNA was selected using oligo(dT) beads (Invitrogen). Strand-specific RNA-seq libraries were prepared as previously described (Guo et al., 2010) or using a dUTP-based approach (Bioo Scientific) according to the manufacturer’s directions. 3P-Seq Sample Preparation RNA from wild-type and miR-22 knockout (Gurha et al., 2012) mouse tissues was isolated by adding a steel bead and 1 ml of TRI Reagent to tissues and then vortexing for 2 min in a TissueLyser II (QIAGEN) at 30 Hz twice. The ho- mogenate was centrifuged for 8 min at 12,000 3 g, and the supernatant was purified according to the manufacturer’s protocol, with an additional phenol/ chloroform extraction after phase separation. 3P-seq libraries were prepared from 75 mg of isolated RNA (mouse tissues, mESC, NIH 3T3, HeLa, HEK293, Huh7, IMR90 cells) as described previously (Jan et al., 2011) with modifica- tions (see Supplemental Information). Luciferase Assays HEK293 andHeLa cells were plated in 24-well plates 24 hr prior to transfection. Cells were transfected using Lipofectamine 2000 and Opti-MEM with 100 ng of Renilla luciferase reporter plasmid and 20 ng of firefly luciferase control reporter plasmid pIS0 (Grimson et al., 2007) per well. Cells were harvested after 24 hr. Luciferase activities were measured using dual-luciferase assays, as described by the manufacturer (Promega). Three or four biological replicates, each with three technical replicates (i.e., three different wells transfected on the same day), were performed. Renilla activity was first normalized to firefly activity to control for transfection efficiency. As described previously (Grimson et al., 2007), repression of the reporter with wild-type sites was then additionally normalized to that of a reporter in which the sites were mutated. Fold repression was calculated relative to that of the noncognate miRNA. Mice The mice harboring the null miR-22 mutant allele were described previously (Gurha et al., 2012). All animal procedures were approved by the Baylor College of Medicine Institutional Animal Care and Use Committee (Animal Protocol 4930). Microarrays were carried out using Illumina Mouse WG-6 v1.1 Whole-Genome Expression BeadChips on 9-week-old miR-22 null and wild-type mice as described previously (Gurha et al. 2012). ACCESSION NUMBERS The NCBI GEO accession number for the microarray data from wild-type and miR-155 knockout B cells reported in this paper is GSE52940. Transcript profiling by microarray for wild-type and miR-22 knockout mouse tissues is deposited in EBI ArrayExpress as E-MTAB-2038. The NCBI GEO accession number for the RNA-seq and 3P-seq data sets reported in this paper is GSE52531. SUPPLEMENTAL INFORMATION Supplemental Information includes Supplemental Experimental Procedures, seven figures, and five tables and can be found with this article online at http://dx.doi.org/10.1016/j.molcel.2014.02.013. ACKNOWLEDGMENTS We thank theWI genome technology core for sequencing and members of the Bartel and Nam labs for helpful comments and discussions. We also thank C. Shin and D. Baek for providing B cell microarray data. This work was sup- ported by the KRIBB Research Initiative Program and the Basic Science Research Program through NRF, funded by the Ministry of Science, ICT & Future Planning, awarded to J.-W.N. (NRF-2013R1A1A1010185), grants from the NIH to D.P.B. and O.S.R. (RO1 GM067031 and K99 GM102319), and an NSF Graduate Research Fellowship to V.A. D.P.B. is an investigator of the Howard Hughes Medical Institute. Received: November 4, 2013 Revised: January 27, 2014 Accepted: February 6, 2014 Published: March 13, 2014 Molecular Cell Effects of Cellular Context on miRNA Repression Mo ar Cell 53, 1031–1043, March 20, 2014 ª2014 Elsevier Inc. 1041 vier 217 I REFERENCES Baek, D., Ville´n, J., Shin, C., Camargo, F.D., Gygi, S.P., and Bartel, D.P. (2008). The impact of microRNAs on protein output. Nature 455, 64–71. Bartel, D.P. (2009). MicroRNAs: target recognition and regulatory functions. Cell 136, 215–233. Berg, M.G., Singh, L.N., Younis, I., Liu, Q., Pinto, A.M., Kaida, D., Zhang, Z., Cho, S., Sherrill-Mix, S., Wan, L., and Dreyfuss, G. (2012). U1 snRNP deter- mines mRNA length and regulates isoform expression. Cell 150, 53–64. Betel, D., Koppal, A., Agius, P., Sander, C., and Leslie, C. (2010). Comprehensive modeling of microRNA targets predicts functional non- conserved and non-canonical sites. Genome Biol. 11, R90. Bhattacharjee, R.B., and Bag, J. (2012). Depletion of nuclear poly(A) bind- ing protein PABPN1 produces a compensatory response by cytoplasmic PABP4 and PABP5 in cultured human cells. PLoS ONE 7, e53036. Chi, S.W., Hannon, G.J., and Darnell, R.B. (2012). An alternative mode of microRNA target recognition. Nat. Struct. Mol. Biol. 19, 321–327. Cooper, S.J., Trinklein, N.D., Nguyen, L., and Myers, R.M. (2007). Serum response factor binding sites differ in three human cell types. Genome Res. 17, 136–144. Derti, A., Garrett-Engele, P., Macisaac, K.D., Stevens, R.C., Sriram, S., Chen, R., Rohl, C.A., Johnson, J.M., and Babak, T. (2012). A quantitative atlas of polyadenylation in five mammals. Genome Res. 22, 1173–1183. Farnham, P.J. (2009). Insights from genomic profiling of transcription factors. Nat. Rev. Genet. 10, 605–616. Friedman, R.C., Farh, K.K.-H., Burge, C.B., and Bartel, D.P. (2009). Most mammalian mRNAs are conserved targets of microRNAs. Genome Res. 19, 92–105. Garcia, D.M., Baek, D., Shin, C., Bell, G.W., Grimson, A., and Bartel, D.P. (2011). Weak seed-pairing stability and high target-site abundance decrease the proficiency of lsy-6 and other microRNAs. Nat. Struct. Mol. Biol. 18, 1139–1146. Giraldez, A.J., Mishima, Y., Rihel, J., Grocock, R.J., Van Dongen, S., Inoue, K., Enright, A.J., and Schier, A.F. (2006). Zebrafish MiR-430 promotes deadenyla- tion and clearance of maternal mRNAs. Science 312, 75–79. Grimson, A., Farh, K.K.-H., Johnston, W.K., Garrett-Engele, P., Lim, L.P., and Bartel, D.P. (2007). MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol. Cell 27, 91–105. Gu, S., Jin, L., Zhang, F., Sarnow, P., and Kay, M.A. (2009). Biological basis for restriction of microRNA targets to the 30 untranslated region in mammalian mRNAs. Nat. Struct. Mol. Biol. 16, 144–150. Guo, H., Ingolia, N.T., Weissman, J.S., and Bartel, D.P. (2010). Mammalian microRNAs predominantly act to decrease target mRNA levels. Nature 466, 835–840. Gurha, P., Abreu-Goodger, C., Wang, T., Ramirez, M.O., Drumond, A.L., van Dongen, S., Chen, Y., Bartonicek, N., Enright, A.J., Lee, B., et al. (2012). Targeted deletion of microRNA-22 promotes stress-induced cardiac dilation and contractile dysfunction. Circulation 125, 2751–2761. Helwak, A., Kudla, G., Dudnakova, T., and Tollervey, D. (2013). Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding. Cell 153, 654–665. Hendrickson, D.G., Hogan, D.J., McCullough, H.L., Myers, J.W., Herschlag, D., Ferrell, J.E., and Brown, P.O. (2009). Concordant regulation of translation and mRNA abundance for hundreds of targets of a human microRNA. PLoS Biol. 7, e1000238. Hoque, M., Ji, Z., Zheng, D., Luo, W., Li, W., You, B., Park, J.Y., Yehia, G., and Tian, B. (2013). Analysis of alternative cleavage and polyadenylation by 30 re- gion extraction and deep sequencing. Nat. Methods 10, 133–139. Jan, C.H., Friedman, R.C., Ruby, J.G., and Bartel, D.P. (2011). Formation, regulation and evolution of Caenorhabditis elegans 3’UTRs. Nature 469, 97–101. Ji, Z., Lee, J.Y., Pan, Z., Jiang, B., and Tian, B. (2009). Progressive lengthening of 30 untranslated regions of mRNAs by alternative polyadenylation during mouse embryonic development. Proc. Natl. Acad. Sci. USA 106, 7028–7033. Ji, Z., Luo, W., Li, W., Hoque, M., Pan, Z., Zhao, Y., and Tian, B. (2011). Transcriptional activity regulates alternative cleavage and polyadenylation. Mol. Syst. Biol. 7, 534. Johnnidis, J.B., Harris, M.H., Wheeler, R.T., Stehling-Sun, S., Lam, M.H., Kirak, O., Brummelkamp, T.R., Fleming, M.D., and Camargo, F.D. (2008). Regulation of progenitor cell proliferation and granulocyte function by microRNA-223. Nature 451, 1125–1129. Kedde, M., Strasser, M.J., Boldajipour, B., Oude Vrielink, J.A.F., Slanchev, K., le Sage, C., Nagel, R., Voorhoeve, P.M., van Duijse, J., Ørom, U.A., et al. (2007). RNA-binding protein Dnd1 inhibits microRNA access to target mRNA. Cell 131, 1273–1286. Kedde, M., van Kouwenhove, M., Zwart, W., Oude Vrielink, J.A.F., Elkon, R., and Agami, R. (2010). A Pumilio-induced RNA structure switch in p27-30 UTR controls miR-221 and miR-222 accessibility. Nat. Cell Biol. 12, 1014– 1020. Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U., and Segal, E. (2007). The role of site accessibility in microRNA target recognition. Nat. Genet. 39, 1278– 1284. Khorshid, M., Hausser, J., Zavolan, M., and van Nimwegen, E. (2013). A bio- physical miRNA-mRNA interaction model infers canonical and noncanonical targets. Nat. Methods 10, 253–255. Landgraf, P., Rusu, M., Sheridan, R., Sewer, A., Iovino, N., Aravin, A., Pfeffer, S., Rice, A., Kamphorst, A.O., Landthaler, M., et al. (2007). A mammalian microRNA expression atlas based on small RNA library sequencing. Cell 129, 1401–1414. Lee, J.Y., Yeh, I., Park, J.Y., and Tian, B. (2007). PolyA_DB 2: mRNA polyade- nylation sites in vertebrate genes. Nucleic Acids Res. 35 (Database issue), D165–D168. Lewis, B.P., Burge, C.B., and Bartel, D.P. (2005). Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120, 15–20. Lianoglou, S., Garg, V., Yang, J.L., Leslie, C.S., and Mayr, C. (2013). Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression. Genes Dev. 27, 2380–2396. Loeb, G.B., Khan, A.A., Canner, D., Hiatt, J.B., Shendure, J., Darnell, R.B., Leslie, C.S., and Rudensky, A.Y. (2012). Transcriptome-wide miR-155 binding map reveals widespread noncanonical microRNA targeting. Mol. Cell 48, 760–770. Majoros, W.H., Lekprasert, P., Mukherjee, N., Skalsky, R.L., Corcoran, D.L., Cullen, B.R., and Ohler, U. (2013). MicroRNA target site identification by inte- grating sequence and binding information. Nat. Methods 10, 630–633. Mayr, C., and Bartel, D.P. (2009). Widespread shortening of 3’UTRs by alterna- tive cleavage and polyadenylation activates oncogenes in cancer cells. Cell 138, 673–684. Miles, W.O., Tscho¨p, K., Herr, A., Ji, J.-Y., and Dyson, N.J. (2012). Pumilio facilitates miRNA regulation of the E2F3 oncogene. Genes Dev. 26, 356–368. Miyamoto, S., Chiorini, J.A., Urcelay, E., and Safer, B. (1996). Regulation of gene expression for translation initiation factor eIF-2 alpha: importance of the 30 untranslated region. Biochem. J. 315, 791–798. Nielsen, C.B., Shomron, N., Sandberg, R., Hornstein, E., Kitzman, J., and Burge, C.B. (2007). Determinants of targeting by endogenous and exogenous microRNAs and siRNAs. RNA 13, 1894–1910. Rodriguez, A., Vigorito, E., Clare, S., Warren, M.V., Couttet, P., Soond, D.R., van Dongen, S., Grocock, R.J., Das, P.P., Miska, E.A., et al. (2007). Requirement of bic/microRNA-155 for normal immune function. Science 316, 608–611. Sandberg, R., Neilson, J.R., Sarma, A., Sharp, P.A., and Burge, C.B. (2008). Proliferating cells express mRNAs with shortened 30 untranslated regions and fewer microRNA target sites. Science 320, 1643–1647. Molecular Cell Effects of Cellular Context on miRNA Repression 1042 Molecular Cell 53, 1031–1043, March 20, 2014 ª2014 Else nc. 218 lecul Shepard, P.J., Choi, E.-A., Lu, J., Flanagan, L.A., Hertel, K.J., and Shi, Y. (2011). Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq. RNA 17, 761–772. Shin, C., Nam, J.-W., Farh, K.K.-H., Chiang, H.R., Shkumatava, A., and Bartel, D.P. (2010). Expanding the microRNA targeting code: functional sites with centered pairing. Mol. Cell 38, 789–802. Smibert, P.,Miura, P.,Westholm, J.O., Shenker, S.,May,G., Duff,M.O., Zhang, D., Eads, B.D., Carlson, J., Brown, J.B., et al. (2012). Global patterns of tissue- specific alternative polyadenylation in Drosophila. Cell Rep 1, 277–289. Spies, N., Burge, C.B., and Bartel, D.P. (2013). 30 UTR-isoform choice has limited influence on the stability and translational efficiency of most mRNAs in mouse fibroblasts. Genome Res. 23, 2078–2090. Tian, B., Hu, J., Zhang, H., and Lutz, C.S. (2005). A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res. 33, 201–212. Tusher, V.G., Tibshirani, R., and Chu, G. (2001). Significance analysis of micro- arrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98, 5116–5121. Ulitsky, I., Shkumatava, A., Jan, C.H., Subtelny, A.O., Koppstein, D., Bell, G.W., Sive, H., and Bartel, D.P. (2012). Extensive alternative polyadenylation during zebrafish development. Genome Res. 22, 2054–2066. van Dongen, S., Abreu-Goodger, C., and Enright, A.J. (2008). Detecting microRNA binding and siRNA off-target effects from expression data. Nat. Methods 5, 1023–1025. Molecular Cell Effects of Cellular Context on miRNA Repression Mo ar Cell 53, 1031–1043, March 20, 2014 ª2014 Elsevier Inc. 1043 Appendix 2. Assessing the ceRNA hypothesis with quantitative measurements of miRNA and target abundance Rémy Denzler1,2, Vikram Agarwal3,4,5, Joanna Stefano3,4, David P Bartel3,4, and Markus Stoffel1,2 1Institute of Molecular Health Sciences, ETH Zurich, Otto-Stern-Weg 7, HPL H36, 8093 Zurich, Switzerland 2Competence Center of Systems Physiology and Metabolic Disease, ETH Zurich, Otto- Stern-Weg 7, 8093 Zurich, Switzerland 3Howard Hughes Medical Institute and Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA 4Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA 5Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA 02139, USA V.A. performed computational analysis. R.D. performed experiments. J.S. generated RNA sequencing data. R.D. and M.S. designed the study. R.D., V.A., D.P.B., and M.S. wrote the manuscript. Published as: Denzler R, Agarwal V, Stefano J, Bartel DP, Stoffel M. "Assessing the ceRNA hypothesis with quantitative measurements of miRNA and target abundance". 2014. Molecular Cell 54(5):766-776. 219 c. 220 Molecular Cell Article Assessing the ceRNA Hypothesis with Quantitative Measurements of miRNA and Target Abundance Re´my Denzler,1,2 Vikram Agarwal,3,4,5 Joanna Stefano,3,4 David P. Bartel,3,4,* and Markus Stoffel1,2,* 1Institute of Molecular Health Sciences, ETH Zurich, Otto-Stern-Weg 7, HPL H36, 8093 Zurich, Switzerland 2Competence Center of Systems Physiology and Metabolic Disease, ETH Zurich, Otto-Stern-Weg 7, 8093 Zurich, Switzerland 3Howard Hughes Medical Institute and Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA 4Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA 5Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA 02139, USA *Correspondence: dbartel@wi.mit.edu (D.P.B.), stoffel@biol.ethz.ch (M.S.) http://dx.doi.org/10.1016/j.molcel.2014.03.045 SUMMARY Recent studies have reported that competitive endogenous RNAs (ceRNAs) can act as sponges for a microRNA (miRNA) through their binding sites and that changes in ceRNA abundances from indi- vidual genes can modulate the activity of miRNAs. Consideration of this hypothesis would benefit from knowing the quantitative relationship between a miRNA and its endogenous target sites. Here, we altered intracellular target site abundance through expression of an miR-122 target in hepatocytes and livers and analyzed the effects on miR-122 target genes. Target repression was released in a threshold-like manner at high target site abundance (R1.5 3 105 added target sites per cell), and this threshold was insensitive to the effective levels of the miRNA. Furthermore, in response to extreme metabolic liver disease models, global target site abundance of hepatocytes did not change suffi- ciently to affect miRNA-mediated repression. Thus, modulation of miRNA target abundance is unlikely to cause significant effects on gene expression and metabolism through a ceRNA effect. INTRODUCTION MicroRNAs (miRNAs) are an abundant class of small noncoding RNAs that regulate gene expression at the levels of mRNA stabil- ity and translation (Pillai et al., 2005; Eulalio et al., 2008; Guo et al., 2010). They pair to target sites (referred to as miRNA response elements [MREs]) within mRNAs to direct the posttran- scriptional downregulation of these mRNA targets. The human genome has more than 500 miRNA genes, and miRNAs from in- dividual gene families are able to target hundreds of different messenger RNAs (Baek et al., 2008; Friedman et al., 2009). Given that more than half of all human mRNAs are estimated to be conserved miRNA targets, miRNAs are thought to have wide- spread effects on gene regulation (Friedman et al., 2009). Even though many miRNA knockout models show no apparent defect under normal conditions, they frequently exhibit miRNA-depen- dent phenotypes when specific stresses are applied (Li et al., 2009; Brenner et al., 2010). Therefore, miRNAs are proposed to be critical regulators in stress signal mediation and modula- tion, where inadequate miRNA levels and responses can cause or exacerbate disease (Mendell and Olson, 2012). Highly expressed site-containing RNAs, either found naturally or delivered as research reagents, can act as ‘‘sponges’’ to titrate miRNAs away from other normal targets (Ebert et al., 2007; Franco-Zorrilla et al., 2007; Mukherji et al., 2011; Hansen et al., 2013; Memczak et al., 2013). Theoretical and experimental reports have claimed that crosstalk between site-containing RNAs extends far beyond a few highly expressed sponges. Analyses of high-throughput data sets indicate that the activity of a miRNA is not just dependent on its levels but also its relative target site abundance (TA), defined as the relative number of sites within the transcriptome for that miRNA (Arvey et al., 2010; Garcia et al., 2011). One hypothesis suggests that this crosstalk has a widespread regulatory function, with the act of titratingmiRNAs away from their other targets somehow explain- ing why so many target sites have been conserved in evolution (Seitz, 2009). This idea is extended to the notion that many miRNA targets act as competitive endogenous RNAs (ceRNAs) that modulate the repression of other targets as their expression increases or decreases (Salmena et al., 2011; Tay et al., 2011). Experimental evidence for such a ceRNA crosstalk was initially described for the tumor-suppressor gene PTEN, which appears to be regulated by the abundance of its pseudogene (PTENP1) in a DICER-dependent manner (Poliseno et al., 2010). Recent studies have reported the potential physiological relevance of other ceRNAs, including a long noncoding RNA that regulates muscle differentiation (Cesana et al., 2011), an overexpressed 30 untranslated region (30 UTR) inducing cancer in transgenic mice (Fang et al., 2013), and a circular RNA (circRNAs) regulating miR-7 activity in the CNS (Hansen et al., 2013; Memczak et al., 2013). However, such studies have used cancer cell lines with abnormal miRNA and ceRNA expression (Poliseno et al., 2010; Karreth et al., 2011), leaving their physiological relevance in pri- mary cells unclear. The ceRNA hypothesis is controversial because it is difficult to imagine how the change in expression of individual miRNA tar- gets, which each typically contribute a miniscule fraction of the TA, could possibly influence enough miRNA molecules to affect 766 Molecular Cell 54, 766–776, June 5, 2014 ª2014 Elsevier In 221 M regulation of other targets. Consideration of the ceRNA hypoth- esis would clearly benefit from quantitative knowledge of the intracellular relationship of miRNAs and their corresponding target sites. Although some attempts have been undertaken to evaluate this relationship, the data were typically acquired in silico (Ala et al., 2013; Figliuzzi et al., 2013), in vitro with purified components (Wee et al., 2012), or in experimental setups in which rapidly dividing cells were transfected with synthetic miRNAs, which complicate any interpretations more quantitative than relative comparisons (Arvey et al., 2010; Garcia et al., 2011; Tay et al., 2011). A more recent study not subject to these limita- tions reported that miRNA efficacy tended to be higher for miRNAs with lower predicted target:miRNA ratios but did not address the question of how much change in ceRNA might be required to detectably influence miRNA efficacy (Mullokandov et al., 2012). In this study, we analyzed the stoichiometric relationship of miR-122 and its target sites by manipulating TA through controlled expression of a validated target of miR-122 in primary hepatocytes and livers. miR-122 has been linked to important human diseases, such as hepatitis C, liver cancer, and hyper- cholesterolemia, and its target genes have been well character- ized (Jopling et al., 2005; Kru¨tzfeldt et al., 2005; Esau et al., 2006; Tsai et al., 2009). Our absolute quantification of relevant entities in primary cells and disease states provided insights on the rela- tionship between miR-122 TA and miR-122 activity. These results will facilitate future studies predicting the biologically relevant range of TAs of other miRNAs and the magnitude of change in target abundance required to influence gene expres- sion through a ceRNA mechanism. RESULTS miRNA Target Derepression Is Detected at a High Threshold of Added MREs To assess the relationship between a miRNA and its MREs and the effect of this relationship on target gene regulation, we chose the highly expressed liver-specific miR-122 as a model system. We manipulated endogenous MREs in a controlled manner by overexpressing a full-length AldolaseA (AldoA) mRNA, a strong and validated target of miR-122 (Kru¨tzfeldt et al., 2005), using re- combinant adenoviruses (Ad-AldoA) carrying either a mutated (Mut), one (1s), or three (3s) miR-122 binding site(s) (Figures 1A and S1A). To eliminate potential off-target effects mediated by the AldoA protein, we introduced a premature stop codon that prevented translation of AldoA protein (Figure S1B). To assess the stoichiometric relationship of miR-122 and the added MREs in primary hepatocytes, we measured the absolute number of these entities per cell. Quantitative RT-PCR measure- ments calibrated with an internal standard curve of synthetic miRNA revealed that miR-122 was expressed at 1.2 3 105 mol- ecules per cell (Figure 1B), which was comparable to levels pre- viously reported (Bissels et al., 2009). As expected, miR-16 and miR-33 were each expressed at fewer copies per cell (1.1 3 104 and 1.2 3 103, respectively). Next, we measured the increased miR-122 target abundance after infecting hepatocytes with Ad- AldoA at three different multiplicities of infection (MOI; 2, 20, and 200) with our constructs that introduced zero, one, or three miR-122 MREs per AldoA transcript. Adenovirus constructs showed very high transduction efficiencies (Figure S2A), and a linear correlation was observed between viral dose and green fluorescent protein (GFP) mRNA, which was expressed from an independent promoter in the Ad-AldoA vector (Figure 1C). Similar results were observed when monitoring GFP protein levels (Figures S2B and S2C). At MOI 200, AldoA transcripts increased from 3.3 3 103 (endogenous levels) to 0.8–1.1 3 106 molecules per cell (Figure 1D), introducing up to 2.63 106 AldoA MREs per cell (Figure 1E). The ratio of AldoA to GFP mRNA showed that the AldoA transcripts were repressed in an MRE- dependent manner at MOI 2 and 20, which confirmed that miR-122 was functionally engaging the MREs within these tran- scripts (Figure 1F). This regulation disappeared at MOI 200, suggesting that, at this very high MOI, AldoA transcript over- whelmed the regulatory capacity of miR-122 (Figure 1F). Quanti- fication of miR-122 confirmed that the loss of regulation was not due to a loss in miR-122; even at very high levels, Ad-AldoA did not influence the levels of either miR-122 or two control miRNAs, although it did reduce miR-33 by 2-fold (Figure 1G). Having observed a loss in AldoA repression at a high MOI, we reasoned that high levels of AldoA transcript could act as a sponge to also derepress cellular miR-122 targets. Indeed, known miR-122 targets, but not a control transcript ApoM, increased at a high MOI (Figures 1H and S2D). Interestingly, this derepression was confidently detected only when AldoA MREs exceeded 1.5–2.7 3 105 per cell. This threshold corre- sponded to 1.25–2.25 MREs per miR-122 molecule. Once this threshold was exceeded, additional AldoA MREs led to greater miR-122 target derepression, and the magnitude correlated with the number of miR-122 sites introduced by AldoA tran- scripts. Altogether, these data demonstrate that derepression mediated through increased expression of a miR-122 target can occur but can be detected only after exceeding a high threshold of added MREs. The High Threshold Persists after Lowering miR-122 Activity Two scenarios might explain the high threshold of added MREs required to observe endogenous target derepression. The ‘‘excess miRNA’’ scenario posits that very abundant miRNAs are present in excess over their targets, and thus competing MREs would need to titrate this excess binding capacity before they could exert an observable effect on endogenous target repression. Our case of miR-122 in hepatocytes would be one of the more attractive candidates for this scenario, given that miR-122 is the most abundant miRNA in hepatocytes (Landgraf et al., 2007). Indeed, its abundance of 1.2 3 105 molecules per cell is among the highest reported for amiRNA in anymammalian system. The second scenario is the ‘‘high TA’’ scenario. In this scenario, the effective number of miRNA binding sites within cellular transcripts is so high that even highly expressed miRNAs are mostly bound to a site at any moment in time, and thus the number of competing MREs would need to approach this high effective number of sites before the competingMREs could exert an observable impact on endogenous target repression. The idea of many miRNA binding sites within cellular transcripts is supported by reports that many miRNAs have hundreds of Molecular Cell Quantitative Evaluation of the ceRNA Hypothesis olecular Cell 54, 766–776, June 5, 2014 ª2014 Elsevier Inc. 767 c. 222 conserved MREs (Friedman et al., 2009), miRNAs also repress many additional mRNAs with nonconserved MREs (Farh et al., 2005; Kru¨tzfeldt et al., 2005; Giraldez et al., 2006; Baek et al., 2008), and high-throughput crosslinking identifies many addi- tional binding sites that would not be classified as MREs because they don’t mediate detectable repression (including many sites within open reading frames and marginally effective sites elsewhere) but would nonetheless add to the effective num- ber of binding sites (Hafner et al., 2010). These two scenarios predict two very different responses to miRNA reduction. In the excess miRNA scenario, miRNA reduction would lower the excess miRNA capacity and thereby lower the threshold of added MREs required to observe endogenous target derepres- sion. In the high TA scenario, the effective number of sites already exceeds the miRNA abundance, and, more importantly, the threshold relates to the effective number of binding sites and pA CMV AldoA CMV AldoA pA pA CMV AldoA Mut 1s, miR-122 3s, miR-122 8 nt 17 nt miR-33 2 20 200 Ad-AldoA Mut Ad-AldoA 1s Ad-AldoA 3s MOI miR-122 2 20 200 0.25 0.5 1 2 MOI Fo ld ch a n ge (m iR N A/ sn o 20 2) miR-16 2 20 200 MOI miR-107 2 20 200 MOI 2 20 200 0.1 1 10 102 103 MOI Fo ld ch an ge (G FP /3 6b 4) Ndrg3 0 103104105106107 AldoA MRE per cell P4ha1 0 103104105106107 AldoA MRE per cell Slc7a1 0 103104105106107 AldoA MRE per cell ApoM 0 103104105106107 Ad-AldoA Mut Ad-AldoA 1s Ad-AldoA 3s AldoA MRE per cell Tmed3 0 103104105106107 AldoA MRE per cell Ccng1 0 103104105106107 AldoA MRE per cell 2 20 200 105 106 103 104 Ad-AldoA Mut Ad-AldoA 1s Ad-AldoA 3s Ad-Ctrl MOI Co pi es pe r ce ll( Al do A) 2 20 200 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 MOI Fo ld ch an ge (A ld o A/ GF P) 1 10 102103104105106107 102 103 104 105 106 107 miR-33 miR-16 miR-122 Synthetic miRNA added (copies/cell) De te ct e d m iR NA (co pi e s/ ce ll) 2 20 200 0 103 104 105 106 107 MOI Al do A M R E pe r ce ll A B DC E F G H Gys1 0 103104105106107 1 2 4 AldoA MRE per cell Fo ld ch a n ge (G en e /3 6b 4) Figure 1. miRNA Target Derepression Is Detected at a High Threshold of Added MREs (A) Schematic overview of the different AldoA-expressing adenovirus constructs (Ad-AldoA) harboring either one (1s, blue) or three (3s, green) miR-122 binding sites or a mutated site (Mut, red). Ad-AldoA 3s contained three 8 nt seed matches of miR-122 separated by 17 nt spacers. See also Figure S1. (B) Absolute miRNA quantification of primary hepatocyte cell lysates spiked with different amounts of synthetic miRNA. Solid lines represent linear regression data with respective 95% confidence intervals. (C–H) Primary hepatocytes infected with different multiplicities of infection (MOI) of the Ad-AldoA constructs. Relative gene expression of GFP (C) and AldoA (F) and absolute copy numbers per cell of AldoA (D) and AldoA MRE (E). Relative expression of miRNAs (G) or miR-122 target genes and a control nontarget gene (ApoM) (H). See also Figure S2. GFP andmiRNA expression are relative to Ad-AldoAMut at MOI 2; AldoA, miR-122 target genes and the control gene are relative to the respective Ad-AldoAMut at given MOI. Data represent mean ± SEM (n = 3) for all panels. Molecular Cell Quantitative Evaluation of the ceRNA Hypothesis 768 Molecular Cell 54, 766–776, June 5, 2014 ª2014 Elsevier In 223 M not the number of miRNA molecules. Thus, in this scenario, miRNA reduction would lower the degree to which targets are repressed, but it would not lower the threshold of added MREs required to observe derepression. By analyzing whether a change in miRNA levels influences the threshold for the number of addedAldoAMREs needed for dere- pression, we sought to experimentally evaluate which scenario applies. We injected three different amounts (low, intermediate, and high) of Antagomir-122 (Ant-122) into mice and found that miR-122 levels detected in the primary hepatocytes were reduced to 0.3, 0.08, and 0.01 of that observed in hepatocytes from mice injected with the mismatch Ant-122 control (Ant- 122mm; Figure 2A) (Kru¨tzfeldt et al., 2005). Target gene dere- pression correlated with decreased miR-122 levels, which confirmed that our miR-122 quantification reflected miR-122 activity (Figure 2B). Next, we studied the effect of controlled overexpression of AldoA MRE on target gene derepression in hepatocytes with a modest 3-fold decrease in miR-122 levels. Interestingly, derepression was detected only when exceeding the threshold of 2 3 105 AldoA MREs per cell (Figure 2C). This miR-122 miR-16 103 104 105 106 Ant-122 intermediate Ant-122 low Ant-122 high Ant-122mm high Co pi es pe rc el l(m iRN A) Gys1 0 105106107 1 2 4 AldoA MRE per cell Fo ld ch an ge (G en e /3 6b 4) 103 104 105 0.25 0.5 1 Slc7a1 P4ha1 Gys1 AldoA Ndrg3 Dyrk2 ApoM miR-122 (copies/cell) Fo ld ch an ge (G e n e /3 6b 4) Snrk 0 105 106 107 Ad-AldoA Mut Ad-AldoA 1s Ad-AldoA 3s AldoA MRE per cell 103 104 105 0 104 105 106 107 Ad-AldoA Mut Ad-AldoA 1s Ad-AldoA 3s Ad-Ctrl miR-122 (copies/cell) Al do A M R E pe r ce ll 103 104 105 105 106 107 103 104 miR-122 (copies/cell) Co pi es pe r ce ll( Al do A) Gys1 103 104 105 1 2 4 miR-122 (copies/cell) Fo ld ch an ge (G en e /3 6b 4) Slc7a1 103 104 105 miR-122 (copies/cell) Ndrg3 103 104 105 miR-122 (copies/cell) Tmed3 103 104 105 miR-122 (copies/cell) Ccng1 103 104 105 miR-122 (copies/cell) Snrk 103 104 105 Ad-AldoA Mut Ad-AldoA 1s Ad-AldoA 3s miR-122 (copies/cell) A B C F Slc7a1 0 105 106 107 AldoA MRE per cell Ndrg3 0 105 106 107 AldoA MRE per cell Tmed3 0 105 106 107 AldoA MRE per cell Ccng1 0 105 106 107 AldoA MRE per cell D E Figure 2. The High Threshold Persists after Lowering miR-122 Activity (A) Absolute miRNA copy numbers per cell or (B) relative expression of miR-122 target genes and control nontarget genes (Dyrk2 and ApoM) in primary hepatocytes from mice treated with Ant-122mm or different concentrations of Ant-122. Values for miR-122 target and control genes are normalized to that of the lowest miR-122 concentration. (C) Relative expression of miR-122 target genes and a nontarget gene (Snrk) in primary hepatocytes with 3-fold decreased miR-122 levels shown in (A), infected with MOI 20 and 200 of Ad-AldoA Mut (red), 1s (blue), or 3s (green). (D–F) Primary hepatocytes shown in (A) infectedwithMOI 200 of the three Ad-AldoA constructs. Absolute copy numbers per cell ofAldoA (D) andAldoAMRE (E) in relation to miR-122 copy numbers. (F) Relative expression of miR-122 target genes and control nontarget gene (Snrk) normalized to Ad-AldoA Mut of the respective miR-122 condition. Absolute miRNA copy numbers were calculated by multiplying relative abundance (miRNA/snoRNA202) that were normalized to Ant-122mm with the copy number evaluated in Figure 1B. Data represent mean ± SEM (n = 4) for all panels. Molecular Cell Quantitative Evaluation of the ceRNA Hypothesis olecular Cell 54, 766–776, June 5, 2014 ª2014 Elsevier Inc. 769 c. 224 threshold was comparable to that observed in cells without reduced miR-122 levels, which indicated that the reason for the threshold was not excessmiR-122 binding capacity. Instead, high TA is the more likely reason that the amount of addedMREs must exceed a very high level before exerting an observable effect. Some studies claiming ceRNA-mediated gene regulation focus on the number of sites to miRNA families that are shared between the ceRNAs without differentiating between those miRNAs that are expressed at a level sufficient to repress target genes and those that are not (Jeyapalan et al., 2011; Fang et al., 2013). To demonstrate that derepression can only occur in con- ditions in which target gene repression is happening, we infected hepatocytes harboring different miR-122 levels with Ad-AldoA at MOI 200 andmeasured target derepression. Levels ofAldoA and respective AldoAMRE copy number per cell were comparable in all Ant-treated samples (Figures 2D and 2E).miR-122 target gene derepression was between 1.5- and 2.5-fold in hepatocytes with high miR-122 levels and below 1.5-fold in cells with intermediate miR-122 activity (Figure 2F). No target gene derepression was observed in hepatocytes with the lowest miR-122 levels. Alto- gether, these data demonstrate that miRNAs need to exceed an expression level sufficient to repress their targets in order for targets to be derepressed in a ceRNA-dependent manner. The Magnitude of Derepression Correlates with Predicted Site Efficacy and Number of Added AldoA MREs Previous ceRNA studies have focused on only one or a few tar- gets of a miRNA even though a ceRNA change that influences miRNA activity would be expected to affect more than a few tar- gets. Because any perturbation of a cell might result in spurious expression changes in a few predicted targets, a transcriptome- wide analysis examining the preferential effect on predicted tar- gets would more confidently detect the influence of a competing RNA. Therefore, we extended our quantitative analysis to the transcriptome and performed RNA sequencing (RNA-seq) on primary hepatocytes infected with different Ad-AldoA constructs at MOI 2, 20, and 200. Then, we analyzed the relationship between the derepression of predicted targets and their site number, site type (6, 7, and 8 nt sites), site position, and other determinants used by TargetScan to calculate total context+ scores of predicted miRNA targets (Lewis et al., 2005; Grimson et al., 2007; Garcia et al., 2011). When predicted targets of miR-122, miR-33, miR-16, or abundant miRNA families in liver (either let-7, miR-192, or a combination of the next four most abundant families) were distributed into ten context+ score bins and plotted against their median fold change, the effect of target derepression was evident for predicted targets of miR- 122 but not for those of any of the other miRNA families (Figures 3A, S3A, S3B, and Table S1). As expected, the extent of target derepression correlated with the magnitude of the context+ score as well as with the number of added AldoA MREs. These correlations were also observed in the fold change distributions of miR-122 predicted targets (Figure 3B), and analogous results were obtained when stratifying predicted targets by site type (Figure S3C). Regardless of how we grouped the predicted tar- gets, the same threshold of R1.5 3 105 added MREs per cell was required in order to observe miR-122 derepression. We also studied target gene derepression in primary hepatocytes treated with Ant-122 or the mismatch control Ant-122mm and found that the strongest predicted targets (e.g., those with a context+ scores below –0.2) were significantly derepressed in the Ant-122-treated conditions (Figures S3D–S3G and Table S1). Modest Changes in Target Abundance Are Induced by Metabolic Stress and Disease Next, we sought to investigate the quantitative relationship be- tweenMREs added upon Ad-AldoA infection and those normally contributed by mRNAs of primary hepatocytes. First, we tested how transcript abundances, measured by RNA-seq in fragments per kilobase of transcript per million fragments mapped (FPKM) correlated with the absolute copy numbers determined by quan- titative PCR. To this end, we compared the expression levels of four genes that are differentially expressed in primary hepato- cyte and liver samples and found a linear relationship between FPKM and absolute copy numbers over several orders of magni- tude (Figure 4A), which allowed us to transform RNA-seq data to absolute mRNA copies per cell. Then, we compared how AldoA transcript abundance corresponded to genome mRNA abun- dance at different MOIs of Ad-AldoA-infected hepatocytes. The AldoA contribution ranged from 0.3%–0.8% at MOI of 2, 6%–12% at MOI 20, and > 50% of all mRNA at MOI 200 (Fig- ure 4B). In contrast, the largest endogenous contributor to the transcriptome of primary hepatocytes was Transferrin (Trf), which made up only 1.6% of the mRNA (30,000 molecules per cell). Thus, the level of AldoA at the MOI for which derepression was observed (MOI 20 and 200), was substantially higher than that of transcripts from any single cellular gene. We also attempted to place the AldoA abundance within the context of the miR-122 TA within the hepatocyte transcriptome. A previous estimate of miRNA TA considers all of the 7 and 8 nt sites for that miRNA within expressed 30 UTRs (Garcia et al., 2011). This TAmight over- or underestimate the effective number of binding sites of the transcriptome, depending on the extent to which some of these sites are inaccessible (e.g., because they are occluded bymRNA secondary structure or RNA binding pro- teins) and the extent to which intracellular binding capacity is augmented by additional sites (e.g., 6 nt sites, other marginal sites, and nonconventional sites as well as sites in ORFs, 50 UTRs, or noncoding RNAs), many of which might add to the effective number of binding sites without mediating repression. Despite these uncertainties, relative TA estimates for different miRNAs provide a useful basis for distinguishing the more effec- tive miRNAs from the less effective ones (Garcia et al., 2011). Our conclusion that competing MREs begin to exert their effects as they approach themiRNA binding capacity of the tran- scriptome provided the means to evaluate the relationship between the previous TA estimate and the apparent number of binding sites. When calculated as before (summing 7 and 8 nt sites in transcriptome 30 UTRs), the miR-122 TA in hepatocytes at Ad-AldoA MOI 2 was 1.83 105 sites per cell, which essentially matched the threshold of added MREs required to begin to observe derepression. The addition of 6 nt sites in the analysis increased the number to 4.4 3 105 miR-122 sites per cell. Given that this was still below the number of added MREs required to Molecular Cell Quantitative Evaluation of the ceRNA Hypothesis 770 Molecular Cell 54, 766–776, June 5, 2014 ª2014 Elsevier In 225 M observe half-maximal derepression, for all additional analyses, we considered this revised TA estimate (all 6, 7, and 8 nt sites within the transcriptome 30 UTRs), which we define as the apparent TA (or TAapp), as a conservative estimate of the effec- tive number of miRNA sites. Next, we calculated how AldoA MREs influenced the miR-122 TAapp (Figure 4C) and what fraction of the TAapp AldoA MREs contributed (Figure 4D). Because only very highly expressed genes could reach the levels required to affect TA, we searched for endogenous transcripts that quantitatively contributed the largest percentage to transcriptome TAapp. Actinb (Actb), which contributed 5.5% of the TAapp, was the largest potential contrib- utor to miR-122 site abundance in primary hepatocytes (Fig- ure 4E), although this contribution was less than the 30% contribution required for AldoA to detectably modulate miR- 122 repression (Figure 4D). When using the same approach to estimate TAapp for let-7, miR-16, miR-33, miR-192, or each of the next four most abundant miRNA families, the transcript with the largest contribution to any TAapp was Albumin (Alb), which contributed 3% of the miR-103 TAapp (Figure 4E). As a major metabolic integrator of physiological processes, the liver exhibits profound changes of gene regulation in response to insulin signaling and cholesterol metabolism. To examine whether these changes might affect miRNA TAapp, we analyzed two models with severe pathological changes in cholesterol metabolism (LDLR-deficient mice, Ldlr–/–) (Ishibashi et al., 1993) and hepatic steatosis (high-fat diet [HFD] mice; Fig- ures 4F, S4A, and Table S2) (Channon and Wilkinson, 1936). We also examined livers that were perfused in the absence and pres- ence of insulin, representing fasted and fed states, respectively (Figure S4B and Table S2). In all livers studied, Alb and Trans- thyretin (Ttr) contributed 10%–20% to TAapp. The only strong contributor that was differentially regulated in any model was major urinary protein 7 (Mup7), which essentially disappeared - 0. 45 - 0. 35 - 0. 25 - 0. 15 - 0. 05 n o sit e Next 4 abun. liver miRNA miR-192 let-7 Bin center (context+ score) -1.0 -0.5 0.0 0.5 1.0 1.5 Context+ 0.00 to -0.15 Context+ -0.15 to -0.30 Context+ -0.30 to -0.45 Context+ below -0.45 No site ** miR-122 Fold change (log2) -0.4 -0.2 0.0 0.2 0.4 0.6 Fo ld ch an ge (lo g 2 ) -0.4 -0.2 0.0 0.2 0.4 0.6 Fo ld ch an ge (lo g 2 ) - 0. 45 - 0. 35 - 0. 25 - 0. 15 - 0. 05 n o sit e -0.4 -0.2 0.0 0.2 0.4 0.6 Bin center (context+ score) Fo ld ch an ge (lo g 2 ) - 0. 45 - 0. 35 - 0. 25 - 0. 15 - 0. 05 n o sit e miR-33 miR-16 miR-122 Bin center (context+ score) 0.0 0.2 0.4 0.6 0.8 1.0 **** **** **** **** miR-122 Cu m u la tiv e fra ct io n ** *** **** miR-122 **** **** **** **** miR-122 0.0 0.2 0.4 0.6 0.8 1.0 ** *** **** **** miR-122 Cu m u la tiv e fra ct io n miR-122 ** *** **** **** miR-122 1.5-1.0 -0.5 0.0 1.00.5 0.0 0.2 0.4 0.6 0.8 1.0 miR-122 Fold change (log2) Cu m u la tiv e fra ct io n 1.5-1.0 -0.5 0.0 0.5 1.0 miR-122 Fold change (log2) M OI 20 0 M OI 20 M OI 2 3s/Mut 1s/Mut 3s/1s 3s/Mut 1s/Mut 3s/1s A B M OI 20 0 M OI 20 M OI 2 Figure 3. The Magnitude of Derepression Correlates with Predicted Site Efficacy and Number of Added AldoA MREs (A and B) RNA-seq results showing derepression of predicted targets from primary hepatocytes infected with MOI 200, 20, and 2 of Ad-AldoA Mut, 1s, or 3s shown in Figures 1C–1H. (A) Predicted targets ofmiR-122 (red), miR-16 (blue), miR-33 (orange), let-7 (green), miR-192 (purple), or a combination of the next fourmost abundant liver miRNA families (black) were grouped into ten bins based on their context+ scores. For eachmiRNA family, themedian log2 fold change is plotted for the predicted targets in each bin. Medians were normalized to that of the bin with genes without sites. Bins each had at least ten genes; see Figure S3B for group sizes. (B) Cumulative distributions of mRNA changes for genes with no miR-122 site (black) or predicted target genes with the indicated context+ score bins (color). Number of genes per bin: black, 6,629; green, 1,693; orange, 434; red, 120; purple, 33. *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001, one-sided Kolmogorov- Smirnov (K-S) test. See also Figure S3 and Table S1. Molecular Cell Quantitative Evaluation of the ceRNA Hypothesis olecular Cell 54, 766–776, June 5, 2014 ª2014 Elsevier Inc. 771 c. 226 in livers of HFD mice, causing its contribution to TAapp to decrease from 11.6% in normal livers to 0.01% in HFD livers. Alb, the most highly expressed mRNA and the largest potential contributor of sites for the miR-103 family, had the potential to reduce TAapp by a maximum of only 20% when fully silenced. Conversely, a 30% increase in target abundance would require the most abundant liver transcript to increase 2.5-fold. Because none of the small number of genes that alone could alter TAapp in a consequential way appeared to do so, we tested whether a substantial change could be achieved through collec- tive changes of all mRNAs. Evaluation of TAapp changes for 102 103 104 105 106 107 1 10 102 103 104 105 106 1 10 AldoA Crot Chka ApoM y = 3.83·x FPKM Co py nu m be rp er ce ll Liver HFD - 0. 45 - 0. 35 - 0. 25 - 0. 15 - 0. 05 n o sit e 0.1% 1% 10% 100% miR-122 let-7 miR-192 miR-16 miR-33 Next 4 abundant liver miRNA Alb Ttr Apoa2 Bin center (context+ score) 2 20 200 1 10 Ant-122mm Ad-AldoA Mut Ad-AldoA 1s Ad-AldoA 3s MOI (A ld oA + Tr a n sc rip to m e )/ (Tr a n sc rip to m e )m iR - 12 2 TA ap p Liver Ldlr-/- / WT Liver HFD / Chow Liver Insulin / PBS 0.0 0.2 0.4 0.6 0.8 1.0 miR-122 let-7 miR-192 miR-103 miR-29 miR-21 miR-101 miR-320 miR-26 miR-423 miR-1839 miR-16 miR-33 Re la tiv e Tr a n sc rip to m e TA ap p (6, - 7- & 8- n ts ite s) 0 2 20 200 0.1% 1% 10% 100% MOI (A ld o A/ Tr a n sc rip to m e )m R N A 0 2 20 200 0.1% 1% 10% 100% Ant-122mm Ad-AldoA 1s Ad-AldoA 3s MOI TA Al od A / T ra n sc rip to m e TA ap p Prim. Hep. MOI2 Ad-AldoA 1s - 0. 45 - 0. 35 - 0. 25 - 0. 15 - 0. 05 n o sit e 0.1% 1% 10% 100% AldoA Alb Ttr Actb Bin center (context+ score) TA M ax / T ra n sc rip to m e TA ap p Liver Chow - 0. 45 - 0. 35 - 0. 25 - 0. 15 - 0. 05 n o sit e 0.1% 1% 10% 100% Alb Ttr Mup7 Apoa2 Bin center (context+ score) TA M ax / T ra n sc rip to m e TA ap p A B C G D E F Figure 4. Modest Changes in Target Abundance Are Induced by Metabolic Stress and Disease (A) Relationship between FPKM from RNA-seq data and absolute quantification with qPCR. Represented are four genes quantified in all 11 primary hepatocyte samples plus wild-type and Ldlr–/– liver samples. Line represents linear regression of data points. Data represent mean ± 95% confidence intervals. (B–D) RNA-seq data fromprimary hepatocytes infectedwithMOI 200, 20, and 2 of Ad-AldoAMut, 1s, or 3s shown in Figure 1C–H. Data representmean ± SEM (B). Contribution of AldoAmRNA to the sum of genomemRNA. Increase of transcriptomemiR-122 TAapp (C) and the respective contribution of AldoAMRE (D) to total transcriptome miR-122 TAapp mediated by the different Ad-AldoA constructs and viral concentrations. (E and F) Fractional contribution of the largest potential contributors to transcriptome TAapp in primary hepatocytes infected with MOI 2 of Ad-AldoA 1s (E) or in wild-type livers (F) originated from mice either fed normal chow or high-fat diet (HFD). Potential contributors were binned by their context+ scores, and the top potential contributors are plotted within each bin. See also Figure S4 and Table S2. (G) Relative target abundance of livers from models of physiological (insulin) or disease/stress states (Ldlr–/– and HFD). Molecular Cell Quantitative Evaluation of the ceRNA Hypothesis 772 Molecular Cell 54, 766–776, June 5, 2014 ª2014 Elsevier In 227 M miR-122, the next ten most abundant miRNA families in liver, miR-33, and miR-16 revealed that TAapp values for these miRNAs were not altered more than 25% in any physiological or disease model, and most changes were below 10% (Fig- ure 4G). We also calculated TAapp values for the liver samples and primary hepatocytes infected with Ad-AldoA at MOI 20, in which derepression was observed. Transcriptome TAapp values ranged between 2.5–7.5 3 105 sites per cell in liver models, and between 3.6–133 105 in primary hepatocytes (Figure S4C). No ceRNA Effect Is Detected In Vivo To examine the influence of AldoA MREs on target gene dere- pression and relevant physiological endpoints in vivo, we injected wild-type mice with 3 3 109 plaque-forming units of Ad-AldoA and examined livers 5 days postinfection. Virally ex- pressedGFP, and therefore adenovirus expression, was compa- rable in all conditions (Figure 5A). Ad-AldoA increased AldoA transcripts from 2.2 3 102 (endogenous levels) to 4.7 3 103 copies per cell (Figure 5B), introducing between 2.6 3 103 and 5.1 3 103 miR-122 MREs per cell with Ad-AldoA 1s or 3s, respectively (Figure 5C). Overexpression of the Ad-AldoA con- structs did not change levels of miR-122 or a control miRNA (Fig- ure 5D). No derepression of any miR-122 target or control gene (Snrk and Dyrk2) was observed (Figure 5E). Furthermore, we did not detect changes in serum cholesterol levels (Figure 5F), which decrease upon miR-122 inhibition by Ant-122 (Kru¨tzfeldt et al., 2005). As predicted from our studies of primary hepatocytes, these results showed that introduction of 5.1 3 103 miR-122 MREs per cell was insufficient to induce either target derepres- sion or downstream physiological responses. DISCUSSION Our results support a model in which the changes in ceRNAs must begin to approach the TA of miRNA before they can exert a consequential effect on the repression of targets for that miRNA. For miR-122 in hepatocytes, derepression began to be observed at a threshold of 1.53 105 added sites per cell, a value exceeding the physiological levels of any endogenous target as well as the aggregate change of all predicted targets in different disease states. Altogether, our data imply that a ceRNA effect mediated through a single miRNA family in a physiological or dis- ease setting of the liver is unlikely. However, we cannot exclude the possibility that unidentified highly abundant and regulated noncoding RNAs (including circRNAs) might substantially con- tribute to the pool of transcriptome binding sites. In stating that changes in endogenous targets are unlikely to mediate a ceRNA effect that is detectable, we do not mean to imply that there is absolutely no molecular consequence of changing the level of an endogenous target. Large changes in each of several dozen target genes could alter TA by 1% or sometimes more, which would influence the repression of other targets but not to an extent that would be detectable by our methods. For example, an increase in TA by 5% is expected to decrease repression of other targets by approximately 5%, causing a target that was previously repress by 30% to now be repressed by approximately 28.5%—a change too small to be detected and presumably too small to be of biological consequence. Studying the stoichiometric relationship of an miRNA and its TA and assessing the effect of this relationship on target gene regulation has been challenging. Estimates of TA have proven to be particularly difficult, given that the extent to which ineffec- tive or marginally effective binding sites contribute to TA has been unclear, and no experimentally determined TA values had been obtained. Our experiments indicate that the TAapp for miR-122 in the hepatocyte transcriptome is 4.4 3 105 sites per cell. Although this estimate corresponds to the number of R6 nt seed-matched sites for miR-122 in the 30 UTRs, we do not presume that all UTR sites mediate repression. Indeed, the TAapp is expected to exceed the number of miR-122 MREs, given that sites that bind the miRNA too transiently to exert repression (including most sites in ORFs) would nonethe- less contribute to TAapp. We qualify our TA estimate as an ‘‘apparent TA’’ for two rea- sons: first, our miR-122 TAapp is expected to be a function of the strength of the miR-122 site that was used in its determina- tion. The AldoA site is relatively strong (context+ score of 0.4; Figure 4E). Had we empirically estimated the TA with a weaker miR-122 site, more of the added sites would have been required to approach half derepression, and thus the TAapp value would have been correspondingly higher. Second, the endogenous sites contribute to TAapp in proportion to their ability to sequester the miRNA, and thus because many weak sites (ranging from those typically classified as nonspecific sites to those that might be more specific yet nonetheless ineffective or marginally 102 103 104 Ad-Ctrl Co pi es pe rc e ll( Al do A) Gy s1 Slc 7a 1 P4 ha 1 Nd rg3 Sn rk Dy rk2 0.5 1 2 Ad-AldoA Mut Ad-AldoA 1s Ad-AldoA 3s Fo ld ch an ge (G e n e /3 6b 4) 0.0 0.5 1.0 1.5 2.0 Fo ld ch an ge (G FP /3 6b 4) 1 3 5 0 50 100 150 Days after injection Ch ol e st er o l(m g/ dl ) Mu t 1s 3s 0.0 0.5 1.0 1.5 2.0 miR-16 miR-122 Fo ld ch an ge (m iR NA /s n o RN A2 02 ) A B C D E F 0 102 103 104 Al do A M RE pe r ce ll Figure 5. No ceRNA Effect Is Detected In Vivo (A–E) Mice were injected with Ad-AldoA Mut (red, n = 6), 1s (blue, n = 6), or 3s (green, n = 5), and gene expression analysis was performed 5 days post- infection. Relative gene expression ofGFP (A), absolute copy numbers per cell of AldoA (B), and added AldoA MREs (C). Relative expression of miRNAs (D) and miR-122 target genes or control nontarget genes (Snrk and Dyrk2) (E). (F) Plasma cholesterol levels of Ad-AldoA-treated mice at days 1, 3, and 5. The Ad-AldoA used in this experiment expressed the full-length protein. Data represent mean ± SEM. Molecular Cell Quantitative Evaluation of the ceRNA Hypothesis olecular Cell 54, 766–776, June 5, 2014 ª2014 Elsevier Inc. 773 c. 228 effective) eachmake partial contributions to the TAapp, the actual number of sites that contributed is expected to greatly exceed the TAapp. When considering this second point, estimating a TAapp is of greater practical value than knowing the total number of endogenous sites that helped sequester the miRNA. Our miR-122 TAapp was empirically derived on the premise that using Ad-AldoA to double the effective miR-122 TA and thereby decrease the number of encounters between miR-122 and its endogenous targets by half would lead to a correspond- ing decrease in endogenous target repression. If the amount of miR-122-mediated repression is not a simple linear function of the number of encounters with its targets, then TAapp would need to be corrected accordingly. For other miRNAs, TAapp values were estimated starting with the miR-122 TAapp and assuming that relative values for different miRNAs would scale in proportion to their numbers of UTR sites—an assumption sup- ported by studies showing that miRNA efficacy negatively corre- lates with the relative numbers of UTR sites (Arvey et al., 2010; Garcia et al., 2011; Mullokandov et al., 2012). Despite any uncer- tainty arising from these simplifying assumptions, our TAapp es- timates have the unique benefit of being founded on intracellular experimental observations. This experimental grounding produced TAapp values much higher than those previously assumed. For example, previous modeling of the quantitative relationships between miRNAs and their targets assumed that a typical miRNA had 500 target sites per cell (Wee et al., 2012). Modeling based on this low num- ber of targets suggests that for moderately expressed miRNAs, adding only 500 sites through increased ceRNA expression could double the expression of a repressed mRNA, whereas for more highly expressed miRNAs, many more sites would be required to exert an effect (Mullokandov et al., 2012; Wee et al., 2012). Our results in hepatocytes indicate that TAapp values for the eleven most abundant miRNA families ranged from 2.53 105 to 7.5 3 105 sites, about 1,000 times greater than the value previously assumed. This substantially revised estimate of effec- tive TA leads to a different and somewhat simplified picture of the potential for regulation through ceRNAs. In our model, miRNA levels matter only in so much as the miRNA must reach a level sufficient to repress a target mRNA. For any miRNA exceeding this level, the potential for ceRNAs to influence repression is simply a matter of whether the ceRNAs add or subtract enough sites to meaningfully influence the TAapp. Because TAapp is a function of the number of seed-matched sites in the transcrip- tome and substantially exceeds the level of even the most highly expressed miRNA, the ceRNA difference required to achieve half-maximal effects is independent of the miRNA level. Thus, our insights and results indicate that repression by even moder- ately expressed miRNAs would be difficult to detectably change through a ceRNA effect. Under extreme physiological and disease conditions, target abundances were not changed more than 10% for most miRNA families. The maximum change of 25% was observed for the let-7 miRNA family in mice fed an HFD versus a chow diet. Interestingly, in this condition, a single highly expressed gene (Mup7) accounted for 50% of the total decrease in let-7 target abundance. A recent phase I trial for RNAi therapy of Ttr amyloidosis reduced human TTR levels by >80% (Coelho et al., 2013). Such a strong reduction of the TTR transcript, which contributes 10% of the miR-192 TAapp in mouse livers, would account for a decrease in miR-192 target abundance analogous to that observed for Mup7 and let-7 in the HFD versus chow diet, a change not expected to detectably affect miRNA activity. The conclusion that only large contributors to TAapp can de- tectably influence the miRNA activity agrees with our in vivo ex- periments; in normal liver, AldoA is expressed at 2.4 3 102 copies per cell and is among the thousand most highly ex- pressed genes. Still, a 9-fold increase in transcript levels after Ad-AldoA 3s infection, which added 5 3 103 MREs, increased miR-122 TAapp by only 2% and therefore imparted no detectable influence on target gene expression. Mup7 and Ttr are among the thirty genes expressed in liver at copy numbers above 104 copies per cell, and therefore approaching within an order of magnitude the estimated miRNA TAapp values. Hence, only these 30 genes have potential on their own to perceptibly influ- ence a TAapp. Our study focused on miR-122, an unusually highly expressed miRNA. Nonetheless, the same high threshold for detectable target derepression was observed when miR-122 activity was reduced, which indicated that our conclusions apply also to more moderately expressed miRNAs. A study reporting loss of miR-20 repression when adding high levels of target mRNA also observed a threshold at high target expression (Mukherji et al., 2011). As expected, their threshold disappeared when a miR-20 sponge was used to lower miRNA activity below detec- tion. More interestingly, they found that transfecting an miR-20 mimic increased the threshold for derepression. A possible reason that they observed a change in threshold with a change in miRNA, whereas we did not, is that their miR-20 mimic might have added enough miRNA to exceed the miR-20 TAapp of their cells. Another difference between their experiments and ours is that their target contained bulged sites of a type that can induce miRNA degradation (Ameres et al., 2010), which might produce an apparent shift in the threshold. Gene expression in the liver is profoundly regulated by circa- dian and hormonal and nutritional states. Using livers of mice exposed to insulin signaling and to pathological conditions of cholesterol metabolism, we did not observe large changes in target abundance, raising the possibility that our findings can be generalized to other organs and disease states. Nonetheless, during cell differentiation and in the context of malignant trans- formation, expression of coding and noncoding RNA can change dramatically (Rhodes and Chinnaiyan, 2005; Lujambio and Lowe, 2012). In such biological settings conditions might arise in which TAapp is lower than in physiological settings and/or a single mRNA substantially contributes to target abundance. In principle, such alterations could make the system more amenable to ceRNA-mediated gene regulation. EXPERIMENTAL PROCEDURES Animal Experiments Animals were maintained on a 12 hr light/12 hr dark cycle under a controlled environment in a pathogen-free facility at the Institute for Molecular Systems Biology, ETH Zu¨rich. Mice were administered adenovirus through a single Molecular Cell Quantitative Evaluation of the ceRNA Hypothesis 774 Molecular Cell 54, 766–776, June 5, 2014 ª2014 Elsevier In 229 M tail-vein injection of 3 3 109 plaque-forming units in a final volume of 0.2 ml diluted in PBS and killed 5 days postinjection. Antagomir was administered through tail-vein injections on three consecutive days, and primary hepato- cytes were isolated on day four. For high, intermediate, and low miR-122 inhibition, mice received 3 3 80, 40, and 20 mg/kg Ant-122, respectively. Ant-122mm (control) was used at the highest concentration. All animal exper- iments were approved by the ethics committee of the Kantonale Veterina¨ramt Zu¨rich. Primary Hepatocytes Isolation and Viral Infections Primary hepatocytes of 8- to 12-week-old male C57BL/6N mice were isolated on the basis of the method described by Zhang et al. (2012). Hepa- tocytes were counted and plated at 300,000 cells per well in Dulbecco’s modified Eagle’s medium low-glucose media and adenoviruses were added in Hepatozyme media 4–6 hr after plating and harvested 24 hr post- infection. All cells were incubated at 37C in a humidified atmosphere con- taining 5% CO2. Adenoviruses Recombinant adenoviruses were generated as described in the Supple- mental Experimental Procedures. All adenoviruses expressed GFP from an independent promoter. Ad-Ctrl was based on the same vector backbone (including GFP) but lacked the AldoA transgene. Gene Expression Analysis 2 ug of total RNA was treated with the DNA-free Kit (Life Technologies) and reverse transcribed with the High Capacity cDNA Reverse Transcription Kit (Life Technologies). Quantitative PCR reactions were performed with the LightCycler 480 (Roche) employing KAPA SYBR FAST qPCR Master Mix (23) for LightCycler 480 (Kapa Biosystems) and gene-specific primer pairs (Table S3). Relative gene expression was calculated with the ddCT method and mouse 36b4 (Rplp0) for normalization. miRNA Expression Analysis 150 ng of total RNA was reverse-transcribed with the TaqMan MicroRNA Assays and Reverse Transcription Kits (Life Technologies). Quantitative PCR reactions were performed with the LightCycler 480 employing TaqMan Universal PCR Master Mix, No AmpErase UNG (Life Technologies), and TaqMan MicroRNA Assays (Life Technologies). Relative miRNA expres- sion was calculated with the ddCT method and mouse snoRNA202 for normalization. RNA-Seq For single-end library construction, total RNA was depleted of rRNA with the Ribo-Zero rRNA Removal Kit (Epicenter). RNA libraries were prepared with the dUTP-based, Illumina-compatible NEXTflex Directional RNA-Seq Kit (Bioo Scientific). For paired-end library construction (performed by BGI), total RNA was enriched for poly(A) mRNA with oligo(dT) beads and treated with buffer in order to yield 200–700 nt fragments. First-strand cDNA was synthesized with random hexamer primers, and second-strand cDNA was synthesized with buffer, dNTPs, RNase H, and DNA polymerase I. cDNA was run on an Agarose gel for suitable fragment size selection followed by a purification, adaptor ligation, and PCR amplification. All libraries (both single- and paired-end) were sequenced with an Illumina HiSeq 2000 sequencing machine. ACCESSION NUMBERS The NCBI Gene Expression Omnibus accession number for the data reported in this paper is GSE52801. SUPPLEMENTAL INFORMATION Supplemental Information contains Supplemental Experimental Procedures, four figures, and three tables and can be found with this article online at http://dx.doi.org/10.1016/j.molcel.2014.03.045. ACKNOWLEDGMENTS We would like to thank M. Ravichandran and W. Johnston for technical assis- tance as well as D. Koppstein, V. Auyeung, M. Latreille, and members of the D.P.B. and M.S. labs for critically reviewing this manuscript. This material is based upon work supported under a National Science Foundation Graduate Research Fellowship (to V.A.), an ERC grant (Metabolomirs) and the NCCR (RNA and Biology; to M.S.), and NIH grant GM067031 (to D.P.B.). D.P.B. is a Howard Hughes Medical Institute Investigator. D.P.B. and M.S. are members of the scientific advisory boards of Alnylam Pharmaceuticals and Regulus Therapeutics. Received: January 7, 2014 Revised: March 4, 2014 Accepted: March 19, 2014 Published: May 1, 2014 REFERENCES Ala, U., Karreth, F.A., Bosia, C., Pagnani, A., Taulli, R., Le´opold, V., Tay, Y., Provero, P., Zecchina, R., and Pandolfi, P.P. (2013). Integrated transcriptional and competitive endogenous RNA networks are cross-regulated in permissive molecular environments. Proc. Natl. Acad. Sci. USA 110, 7154–7159. Ameres, S.L., Horwich, M.D., Hung, J.H., Xu, J., Ghildiyal, M., Weng, Z., and Zamore, P.D. (2010). Target RNA-directed trimming and tailing of small silencing RNAs. Science 328, 1534–1539. Arvey, A., Larsson, E., Sander, C., Leslie, C.S., and Marks, D.S. (2010). Target mRNA abundance dilutesmicroRNA and siRNA activity.Mol. Syst. Biol. 6, 363. Baek, D., Ville´n, J., Shin, C., Camargo, F.D., Gygi, S.P., and Bartel, D.P. (2008). The impact of microRNAs on protein output. Nature 455, 64–71. Bissels, U., Wild, S., Tomiuk, S., Holste, A., Hafner, M., Tuschl, T., and Bosio, A. (2009). Absolute quantification of microRNAs by using a universal reference. RNA 15, 2375–2384. Brenner, J.L., Jasiewicz, K.L., Fahley, A.F., Kemp, B.J., and Abbott, A.L. (2010). Loss of individual microRNAs causes mutant phenotypes in sensitized genetic backgrounds in C. elegans. Curr. Biol. 20, 1321–1325. Cesana,M., Cacchiarelli, D., Legnini, I., Santini, T., Sthandier, O., Chinappi, M., Tramontano, A., and Bozzoni, I. (2011). A long noncoding RNA controls muscle differentiation by functioning as a competing endogenous RNA. Cell 147, 358–369. Channon, H.J., and Wilkinson, H. (1936). The effect of various fats in the pro- duction of dietary fatty livers. Biochem. J. 30, 1033–1039. Coelho, T., Adams, D., Silva, A., Lozeron, P., Hawkins, P.N., Mant, T., Perez, J., Chiesa, J., Warrington, S., Tranter, E., et al. (2013). Safety and efficacy of RNAi therapy for transthyretin amyloidosis. N. Engl. J. Med. 369, 819–829. Ebert, M.S., Neilson, J.R., and Sharp, P.A. (2007). MicroRNA sponges: competitive inhibitors of small RNAs in mammalian cells. Nat. Methods 4, 721–726. Esau, C., Davis, S., Murray, S.F., Yu, X.X., Pandey, S.K., Pear, M., Watts, L., Booten, S.L., Graham, M., McKay, R., et al. (2006). miR-122 regulation of lipid metabolism revealed by in vivo antisense targeting. Cell Metab. 3, 87–98. Eulalio, A., Huntzinger, E., and Izaurralde, E. (2008). Getting to the root of miRNA-mediated gene silencing. Cell 132, 9–14. Fang, L., Du, W.W., Yang, X., Chen, K., Ghanekar, A., Levy, G., Yang, W., Yee, A.J., Lu, W.Y., Xuan, J.W., et al. (2013). Versican 30-untranslated region (30-UTR) functions as a ceRNA in inducing the development of hepatocellular carcinoma by regulating miRNA activity. FASEB J. 27, 907–919. Farh, K.K., Grimson, A., Jan, C., Lewis, B.P., Johnston, W.K., Lim, L.P., Burge, C.B., and Bartel, D.P. (2005). The widespread impact of mammalian MicroRNAs on mRNA repression and evolution. Science 310, 1817–1821. Figliuzzi, M., Marinari, E., and De Martino, A. (2013). MicroRNAs as a selective channel of communication between competing RNAs: a steady-state theory. Biophys. J. 104, 1203–1213. Molecular Cell Quantitative Evaluation of the ceRNA Hypothesis olecular Cell 54, 766–776, June 5, 2014 ª2014 Elsevier Inc. 775 c. 230 Franco-Zorrilla, J.M., Valli, A., Todesco, M., Mateos, I., Puga, M.I., Rubio- Somoza, I., Leyva, A., Weigel, D., Garcı´a, J.A., and Paz-Ares, J. (2007). Target mimicry provides a newmechanism for regulation of microRNA activity. Nat. Genet. 39, 1033–1037. Friedman, R.C., Farh, K.K.H., Burge, C.B., and Bartel, D.P. (2009). Most mammalian mRNAs are conserved targets of microRNAs. Genome Res. 19, 92–105. Garcia, D.M., Baek, D., Shin, C., Bell, G.W., Grimson, A., and Bartel, D.P. (2011). Weak seed-pairing stability and high target-site abundance decrease the proficiency of lsy-6 and other microRNAs. Nat. Struct. Mol. Biol. 18, 1139–1146. Giraldez, A.J., Mishima, Y., Rihel, J., Grocock, R.J., Van Dongen, S., Inoue, K., Enright, A.J., and Schier, A.F. (2006). Zebrafish MiR-430 promotes deadenyla- tion and clearance of maternal mRNAs. Science 312, 75–79. Grimson, A., Farh, K.K., Johnston, W.K., Garrett-Engele, P., Lim, L.P., and Bartel, D.P. (2007). MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol. Cell 27, 91–105. Guo, H., Ingolia, N.T., Weissman, J.S., and Bartel, D.P. (2010). Mammalian microRNAs predominantly act to decrease target mRNA levels. Nature 466, 835–840. Hafner, M., Landthaler, M., Burger, L., Khorshid, M., Hausser, J., Berninger, P., Rothballer, A., Ascano, M., Jr., Jungkamp, A.C., Munschauer, M., et al. (2010). Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell 141, 129–141. Hansen, T.B., Jensen, T.I., Clausen, B.H., Bramsen, J.B., Finsen, B., Damgaard, C.K., and Kjems, J. (2013). Natural RNA circles function as efficient microRNA sponges. Nature 495, 384–388. Ishibashi, S., Brown, M.S., Goldstein, J.L., Gerard, R.D., Hammer, R.E., and Herz, J. (1993). Hypercholesterolemia in low density lipoprotein receptor knockout mice and its reversal by adenovirus-mediated gene delivery. J. Clin. Invest. 92, 883–893. Jeyapalan, Z., Deng, Z., Shatseva, T., Fang, L., He, C., and Yang, B.B. (2011). Expression of CD44 30-untranslated region regulates endogenous microRNA functions in tumorigenesis and angiogenesis. Nucleic Acids Res. 39, 3026– 3041. Jopling, C.L., Yi, M., Lancaster, A.M., Lemon, S.M., and Sarnow, P. (2005). Modulation of hepatitis C virus RNA abundance by a liver-specific MicroRNA. Science 309, 1577–1581. Karreth, F.A., Tay, Y., Perna, D., Ala, U., Tan, S.M., Rust, A.G., DeNicola, G., Webster, K.A., Weiss, D., Perez-Mancera, P.A., et al. (2011). In vivo identifica- tion of tumor- suppressive PTEN ceRNAs in an oncogenic BRAF-induced mouse model of melanoma. Cell 147, 382–395. Kru¨tzfeldt, J., Rajewsky, N., Braich, R., Rajeev, K.G., Tuschl, T., Manoharan, M., and Stoffel, M. (2005). Silencing of microRNAs in vivo with ‘antagomirs’. Nature 438, 685–689. Landgraf, P., Rusu, M., Sheridan, R., Sewer, A., Iovino, N., Aravin, A., Pfeffer, S., Rice, A., Kamphorst, A.O., Landthaler, M., et al. (2007). A mammalian microRNA expression atlas based on small RNA library sequencing. Cell 129, 1401–1414. Lewis, B.P., Burge, C.B., and Bartel, D.P. (2005). Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120, 15–20. Li, X., Cassidy, J.J., Reinke, C.A., Fischboeck, S., and Carthew, R.W. (2009). A microRNA imparts robustness against environmental fluctuation during devel- opment. Cell 137, 273–282. Lujambio, A., and Lowe, S.W. (2012). Themicrocosmos of cancer. Nature 482, 347–355. Memczak, S., Jens, M., Elefsinioti, A., Torti, F., Krueger, J., Rybak, A., Maier, L., Mackowiak, S.D., Gregersen, L.H., Munschauer, M., et al. (2013). Circular RNAs are a large class of animal RNAs with regulatory potency. Nature 495, 333–338. Mendell, J.T., and Olson, E.N. (2012). MicroRNAs in stress signaling and human disease. Cell 148, 1172–1187. Mukherji, S., Ebert, M.S., Zheng, G.X., Tsang, J.S., Sharp, P.A., and van Oudenaarden, A. (2011). MicroRNAs can generate thresholds in target gene expression. Nat. Genet. 43, 854–859. Mullokandov, G., Baccarini, A., Ruzo, A., Jayaprakash, A.D., Tung, N., Israelow, B., Evans, M.J., Sachidanandam, R., and Brown, B.D. (2012). High-throughput assessment of microRNA activity and function using microRNA sensor and decoy libraries. Nat. Methods 9, 840–846. Pillai, R.S., Bhattacharyya, S.N., Artus, C.G., Zoller, T., Cougot, N., Basyuk, E., Bertrand, E., and Filipowicz, W. (2005). Inhibition of translational initiation by Let-7 MicroRNA in human cells. Science 309, 1573–1576. Poliseno, L., Salmena, L., Zhang, J., Carver, B., Haveman, W.J., and Pandolfi, P.P. (2010). A coding-independent function of gene and pseudogene mRNAs regulates tumour biology. Nature 465, 1033–1038. Rhodes, D.R., and Chinnaiyan, A.M. (2005). Integrative analysis of the cancer transcriptome. Nat. Genet. Suppl. 37, S31–S37. Salmena, L., Poliseno, L., Tay, Y., Kats, L., and Pandolfi, P.P. (2011). A ceRNA hypothesis: the Rosetta Stone of a hidden RNA language? Cell 146, 353–358. Seitz, H. (2009). Redefining microRNA targets. Curr. Biol. 19, 870–873. Tay, Y., Kats, L., Salmena, L., Weiss, D., Tan, S.M., Ala, U., Karreth, F., Poliseno, L., Provero, P., Di Cunto, F., et al. (2011). Coding-independent regu- lation of the tumor suppressor PTEN by competing endogenous mRNAs. Cell 147, 344–357. Tsai, W.C., Hsu, P.W., Lai, T.C., Chau, G.Y., Lin, C.W., Chen, C.M., Lin, C.D., Liao, Y.L., Wang, J.L., Chau, Y.P., et al. (2009). MicroRNA-122, a tumor sup- pressor microRNA that regulates intrahepatic metastasis of hepatocellular carcinoma. Hepatology 49, 1571–1582. Wee, L.M., Flores-Jasso, C.F., Salomon, W.E., and Zamore, P.D. (2012). Argonaute divides its RNA guide into domains with distinct functions and RNA-binding properties. Cell 151, 1055–1067. Zhang, W., Sargis, R.M., Volden, P.A., Carmean, C.M., Sun, X.J., and Brady, M.J. (2012). PCB 126 and other dioxin-like PCBs specifically suppress hepatic PEPCK expression via the aryl hydrocarbon receptor. PLoS ONE 7, e37103. Molecular Cell Quantitative Evaluation of the ceRNA Hypothesis 776 Molecular Cell 54, 766–776, June 5, 2014 ª2014 Elsevier In Appendix 3. Expanded identification and characterization of mammalian circular RNAs Junjie U. Guo1,2,3, Vikram Agarwal1,2,3,4, Huili Guo1,2,3,5 and David P. Bartel1,2,3,6 1Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA 2Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA 3Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA 4Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA 02139, USA 5Present address: Institute of Molecular and Cell Biology, 61 Biopolis Drive, Proteos, Singapore V.A. performed computational analysis related to miRNA target site enrichment and helped devise pipeline for circRNA identification. J.U.G. performed all other computational analyses. H.G. generated ribosomal footprinting data. J.U.G. and D.P.B. designed the study and wrote the manuscript. Published as: Guo JU, Agarwal V, Guo H, Bartel DP. "Expanded identification and characterization of mammalian circular RNAs". 2014. Genome Biology 15(7):409. 1-14. 231 RESEARCH Open Access Expanded identification and characterization of mammalian circular RNAs Junjie U Guo1,2,3, Vikram Agarwal1,2,3,4, Huili Guo1,2,3,5,6,7 and David P Bartel1,2,3* Abstract Background: The recent reports of two circular RNAs (circRNAs) with strong potential to act as microRNA (miRNA) sponges suggest that circRNAs might play important roles in regulating gene expression. However, the global properties of circRNAs are not well understood. Results: We developed a computational pipeline to identify circRNAs and quantify their relative abundance from RNA-seq data. Applying this pipeline to a large set of non-poly(A)-selected RNA-seq data from the ENCODE project, we annotated 7,112 human circRNAs that were estimated to comprise at least 10% of the transcripts accumulating from their loci. Most circRNAs are expressed in only a few cell types and at low abundance, but they are no more cell-type-specific than are mRNAs with similar overall expression levels. Although most circRNAs overlap protein-coding sequences, ribosome profiling provides no evidence for their translation. We also annotated 635 mouse circRNAs, and although 20% of them are orthologous to human circRNAs, the sequence conservation of these circRNA orthologs is no higher than that of their neighboring linear exons. The previously proposed miR-7 sponge, CDR1as, is one of only two circRNAs with more miRNA sites than expected by chance, with the next best miRNA-sponge candidate deriving from a gene encoding a primate-specific zinc-finger protein, ZNF91. Conclusions: Our results provide a new framework for future investigation of this intriguing topological isoform while raising doubts regarding a biological function of most circRNAs. Background Many classes of non-protein-coding RNAs (ncRNAs) exist in cells [1,2], and members of each class play important roles in either regulating gene expression or other biological processes [3-6]. For example, microRNAs (miRNAs) pair to sites within messenger RNAs (mRNAs) to target the mRNAs for translational repression and/or mRNA destabilization [7]. In an intriguing elaboration of this regulatory pathway, the activity of the mammalian miR-7 miRNA can be inhibited by CDR1as/ciRS-7, which is in turn targeted by another miRNA, miR-671, which shows near-perfect complementarity and triggers endonucleo- lytic cleavage of CDR1as [8-10]. CDR1as is a circular RNA (circRNA) deriving from an antisense transcript of the CDR1 protein-coding gene [10]. With >60 conserved sites for miR-7, CDR1as is thought to act as a sponge to titrate miR-7 from its other targets [8,9]. A second circRNA proposed to act as a sponge is the testis-specific transcript of the male sex-determining gene Sry, which contains 16 sites for miR-138 [9]. Because circRNAs lack poly(A) tails and 5′ termini, they would escape the deadenylation, decapping and degradation normally caused by miRNA association [11], an obvious advantage for an RNA acting as a miRNA sponge [8,9]. Thousands of additional circRNAs with unknown func- tions have been identified in various species [8,12-15]. These circRNAs are generated primarily through a type of alternative RNA splicing called ‘back-splicing’, in which a splice donor splices to an upstream acceptor rather than a downstream acceptor (Figure 1A) [8,12,14,16,17]. Based on several criteria, including their intriguing expression patterns, their apparently elevated sequence conservation and the compelling hypothesis that CDR1as acts as a miR-7 sponge, these circRNAs have been proposed to comprise a large class of post-transcriptional regulators. However, the number of additional circRNAs acting as natural miRNA sponges is currently unclear. Indeed, the extent to * Correspondence: dbartel@wi.mit.edu 1Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA 2Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA Full list of author information is available at the end of the article © 2014 Guo et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Guo et al. Genome Biology 2014, 15:409 http://genomebiology.com/2014/15/7/409 AC D F B G E Figure 1 (See legend on next page.) Guo et al. Genome Biology 2014, 15:409 Page 2 of 14 http://genomebiology.com/2014/15/7/409 which these circular isoforms might act in any biological capacity is not known. To begin to consider potential roles of circRNAs in post-transcriptional regulation, we developed a compu- tational pipeline that identifies circRNAs from long- read RNA-seq data without relying on gene annotations. The pipeline resembled that reported previously [8], ex- cept it quantifies and considers the abundance of each circular isoform with respect to its alternative linear isoforms. Applying this pipeline to the non-poly(A)- selected RNA-seq data from the ENCODE project, we catalogued >7,000 human circRNAs and characterized their global properties, acquiring new insights regarding their biogenesis, the cell-type specificity of their expres- sion, the extent to which they are conserved, the extent to which they are translated and their potential to act as miRNA sponges. Results Properties of human circRNAs To identify circRNAs from RNA-seq data, we developed the following computational pipeline (Figure 1B). We first mapped all the RNA-seq reads to the genome using Bowtie in single-end mode, allowing ≤2 mismatches. Then we used BLAT to find partial alignment of the un- mapped reads. Dual alignments of two read segments mapping to the genome in the reversed order were indi- cative of circRNAs. The circular fraction (that is, the fraction of the circular isoform relative to all transcripts from the same locus) was quantified for each circRNA candidate by counting relevant reads from the same sample. We performed circRNA identification and quan- tification using all the currently available whole-cell non-poly(A)-selected RNA-seq data from the ENCODE project [1], which included a large variety of cultured cell types (Table S1A in Additional file 1). As in some previous studies [8,14], our pipeline used the assembled genome for sequence alignment but disregarded its an- notations, and thus it was not affected by incomplete or inaccurate genome annotations and was not biased in favor of alternative isoforms of pre-mRNAs. circRNAs produced from back-splicing would be ex- pected to have splicing signals at their junctions. Introns spliced by the major spliceosome usually contain the GU dinucleotide at their 5′ end (the splice donor) and the AG dinucleotide at their 3′ end (the splice acceptor) [18]. Indeed, when we analyzed all the dinucleotide fre- quencies in 10-nucleotide genomic windows mapping to each observed circular junction, a vast majority of candidate circular junctions contained the GT dinucleo- tide within 5 nucleotides of the putative donor end and the AG dinucleotide within 5 nucleotides of the putative acceptor end (Figure 1C; Figure S1A in Additional file 2). Moreover, a search for motifs within 10-nucleotide gen- omic windows flanking the circular junctions recap- tured the canonical sequence motifs of splice donors and acceptors (Figure S1B in Additional file 2). When considering the minority of candidates without GT-AG- flanking junctions, no pronounced dinucleotide enrich- ment or significant motif was observed (Figure S1A,B in Additional file 2). Reasoning that for biological circRNAs a higher frac- tion of the transcript isoforms might be circular, as is the case for CDR1as, for which almost no linear isoform could be detected [8,9], we calculated for each candidate the fraction of its transcript isoforms that were circular and compared the circular fractions of groups of circRNA candidates with different flanking dinucleotide signatures. The circular fractions of GT-AG-flanking candidates tended to be greater than those of the remaining candidates, with the circular fractions of most non-GT-AG-flanking candidates falling below 1% (Figure 1D). To test the ex- tent to which the minor spliceosome might contribute to circRNA formation, we examined the distribution of circular fractions for AT-AC-flanking candidates, but observed no difference from the other non-GT-AG- flanking candidates (Figure 1D). Collectively, these results indicated that back-splicing by the major spliceosome generates most, if not all, cellu- lar circRNAs. Candidates without these splicing signals were more likely to have arisen from sequencing artifacts (such as chimeric RNA-seq reads resulting from template (See figure on previous page.) Figure 1 Global identification of human circRNAs. (A) Schematic illustration of the alternative-splicing isoforms generated from linear splicing (left) and back splicing (right). Two-part alignments identified junction-spanning reads indicative of circRNAs (bottom left). Exons are colored, and donor (GU) and acceptor (AG) signals at splice sites are indicated. (B) The computational pipeline developed to identify and quantify circRNAs from long-read RNA-seq data. (C) Enrichment of donor GT and acceptor AG splicing signals in genomic windows flanking candidate circular junctions supported by ≥5 junction-spanning reads in the CD34 sample. Similar results were obtained from all other cell types. (D) Distribution of circular fractions for circRNA candidates in (C), grouped based on whether their circular junctions were flanked by splicing signals of the major or minor spliceosome (GT-AG- and AT-AC-flanking, respectively). (E) Distributions of exon numbers for circRNAs, mRNAs, and other annotated ncRNAs. (F) Annotations of genomic regions mapping to inferred circRNA exons. CDS, coding sequence; lincRNA, long intervening ncRNA; UTR, untranslated region. (G) Splicing within circRNAs of the CD34 sample. Mapped locations of the mates of junction-spanning reads were compared to the genomic annotations 200 nucleotides downstream and upstream of back-spliced acceptors and donors, respectively. Because the fragment size for the paired-end sequencing averaged 200 nucleotides, these genomic annotations resembled those expected if the introns within the circRNAs were retained. Guo et al. Genome Biology 2014, 15:409 Page 3 of 14 http://genomebiology.com/2014/15/7/409 switching during reverse transcription or PCR), which justified the filter for GT-AG splicing signals imposed in previous pipelines [8]. To maximize the specificity of our pipeline, we carried forward only those candidates flanked by the GT-AG splicing signals, recognizing the possibility that a few candidates discarded by this filter might be authentic circRNAs generated by mechanisms that do not involve the spliceosome, as shown in Archaea [13]. As a second quality filter, we also required that each circRNA have a circular fraction ≥10% in two or more samples. This requirement filtered out about two-thirds of the circRNAs in each sample. With these filters, we an- notated 7,112 circRNAs from 39 biological samples representing a large variety of human cell lines (Table S2A in Additional file 3). Assuming that each circRNA had the same exon struc- ture as the current GENCODE annotation at its locus, we found that most circRNAs spanned <5 exons (Figure 1E), with the distribution of exon abundance resembling that reported for the other GENCODE-annotated ncRNA genes in the human genome [2]. The distribution of circRNA exonic sequence lengths also resembled that of ncRNAs, with a median length of 547 nucleotides, compared with 566 and 2,149 nucleotides for ncRNAs and mRNAs, respectively (Additional file 4). More than half of the circRNAs consisted of only protein-coding exons (Figure 1F), whereas smaller fractions also con- tained 5′ untranslated regions (UTRs), 3′ UTRs, or both. CDR1as was among the 68 circRNAs that mapped anti- sense to annotated protein-coding genes. Another 67 cir- cRNAs mapped to annotated long intervening ncRNAs (lincRNAs) [19], and 342 mapped between annotated genes, with no sense or antisense overlap. Because many circRNAs contained multiple exons (Figure 1E) and previous studies have noticed retained introns in a few circRNAs [10,15], we more systematically examined whether introns within circRNAs were effi- ciently removed. We started by mapping all the mate reads of the circular junction-spanning reads in the CD34+ hematopoietic progenitor cells sample. If intra-circular spli- cing did not occur, most of the mate reads would be expected to map to the first upstream or downstream in- tron from the back-spliced donor or acceptor, respectively (Figure 1G). We found that approximately 80% of the mates reads that did not map to the same exons as the cir- cular junctions mapped to their neighboring exons, indicat- ing that introns within circRNAs were usually spliced out, although a substantial fraction (approximately 20%) were retained (Figure 1G). Comparison with previous circRNA catalogs When comparing our circRNA catalog with those of pre- vious studies, we found that most annotated circRNAs were present in only one catalog (Additional file 5), presumably because of differences in cell types, cutoffs and computational pipelines. A key difference between our catalog and those of others was our requirement that the circRNAs have a circular fraction ≥10%, which prompted us to examine the extent to which this filter ex- plained the differences between our catalog and those of others. For each catalog, we randomly selected one cell type used to build the catalog and quantified the circular fraction of the circRNAs identified in that cell type by the corresponding study, using non-poly(A)-selected RNA- seq data of that cell type. Due to our circular-fraction fil- ter, all the circRNAs from our study had circular fractions of ≥10% (Additional file 5). About half of the circRNAs identified by the Memczak et al. study [8] had circular fractions of ≥10%, whereas less than 10% of the circRNAs from the other two studies, which used either RNase R- treated [14] or poly(A)-depleted RNA-seq data [15] to en- rich for circRNAs, had circular fractions ≥10%. Trans-splicing rarely contributed to back-spliced junctions Trans-splicing between pre-mRNAs can also give rise to the appearance of shuffled exons [20,21], many of which would produce sequencing reads indistinguishable from those that we and others [8] attributed to back-spliced products (Figure 2A). To distinguish between back- splicing and trans-splicing, we used the approach used previously on a smaller set of circRNAs [12]. This ap- proach took advantage of the paired-end RNA-seq data and examined the mate reads of the junction-spanning reads, which for some trans-spliced products would map beyond the genomic regions spanning the acceptors and donors of the junction-spanning reads (Figure 2A). Out of >6,000 mates of junction-spanning reads mapped in the CD34+ hematopoietic progenitor cells sample, only four (all from the ANKRD28 locus) mapped upstream of the back-spliced acceptors, and only one (from the ATF7IP locus) mapped downstream of the back-spliced donors (Figure 2B,C). Although analysis of mate reads would have identified more trans-spliced products if many members of our catalog were in fact trans-spliced and not circular, this analysis presumably missed evidence of trans-splicing in cases for which the exonic distance between the trans- spliced acceptor and donor was too large to exclude any mate reads, which was the case for most circRNAs (Additional file 4). As an orthogonal approach for discrim- inating between back-spliced and trans-spliced products we considered their polyadenylation status [12]. Poly(A) selection should deplete circRNAs but not trans-spliced products, which are linear and thus expected to have poly (A) tails (Figure 2A). Indeed, using data from U2OS cells, which were independent of the data we used for circRNA discovery, we found that of the 598 members of our cata- log detected through junction-spanning reads in non-poly Guo et al. Genome Biology 2014, 15:409 Page 4 of 14 http://genomebiology.com/2014/15/7/409 (A)-selected RNA-seq data, only 20 were detected in poly (A)-selected RNA-seq data, as indicated by circular frac- tions exceeding zero for only these 20 members in the poly(A)-selected data (Figure 2D). Moreover, only six members of our catalog were detected in the poly(A)- selected data but not the non-poly(A)-selected data. The 20 detected in both datasets presumably include both trans-spliced products and circRNAs from loci that also produce trans-spliced isoforms. These observations, in conjunction with the lack of translation across the cir- cular junctions (see below), indicated that trans-splicing contributed very few (<5%) false positives in our cir- cRNA catalog, despite a previous study reporting that shuffled splice isoforms are predominantly trans-spliced products [20]. We attribute our high specificity to our use of non-poly(A)-selected samples for circRNA identi- fication (whereas the previous report started with poly (A)-selected samples) and our requirement that the cir- cular fraction exceeded 10% in at least two samples. These results are consistent with previous studies showing that circRNAs are non-polyadenylated [12] or RNase R-resistant [8,14]. B D A C Figure 2 Trans-splicing rarely contributed to back-spliced junctions. (A) Schematic illustration of the analysis of paired-end reads used to distinguish trans-spliced products from circRNAs. Depending on the insert size, mate reads of trans-spliced but not back-spliced junction-spanning reads could potentially map to adjacent linear exons. Based on the insert sizes of the ENCODE paired-end RNA-seq libraries, we only considered circRNAs that were <400 nucleotides. (B) Distances of all mapped mate reads from the acceptors (left) and donors (right). Two possible trans-spliced events are indicated. (C) The identified trans-spliced event from the ANKRD28 locus. (D) Circular fractions of 598 circRNAs detected in non-poly(A)-selected RNA-seq data from U2OS cells, analyzed using non-poly(A)-selected RNA-seq data (Ribo-Zero) and poly(A)-selected RNA-seq data (poly(A)+). Guo et al. Genome Biology 2014, 15:409 Page 5 of 14 http://genomebiology.com/2014/15/7/409 Expression of circRNAs To act as miRNA sponges or perform other non-catalytic cellular functions, the circRNAs would need to be expressed at consequential levels within the cell. To infer the abundance of each circRNA we multiplied its circular fraction by the density of RNA-seq reads arising from the cognate gene locus (measured in fragments per kilobase of transcript per million fragments sequenced, or FPKM). As observed for all protein-coding genes with FPKM ≥0.1, approximately 40% of all circRNAs annotated from each cell type had an inferred FPKM ≥1, as illus- trated for the CD34+ hematopoietic progenitor cells sam- ple (Figure 3A). However, the abundances of circRNAs tailed off much more quickly than did those of mRNAs. For example, when considering the 562 circRNAs with in- ferred FPKM ≥1.0, only 37 had FPKM ≥10 and none had FPKM ≥100. As a result, our circRNAs comprised a small fraction of the transcriptome of each sample, accounting for an estimated 0.2 to 0.9% of all the exon-mapping reads (Figure 3B). This range is slightly lower than a recent esti- mate of 1% [15], presumably because most low circular- fraction circRNAs were discarded in our analysis. We next examined the cell type specificity of circRNA expression. The 39 biological samples varied in the num- ber of detectable circRNAs (Figure 3C). Although 1,500 to 3,000 circRNAs passed our cutoffs in most cell types, some cell types (for example, HFDPCs (follicle dermal papilla cells)) had approximately three times more cir- cRNAs in the final catalog than others (for example, HAoECs (thoracic aortic endothelial cells)) (Figure 3C). This variation could not be explained by the differences in sequencing depths (Additional file 6). Although some circRNAs (including CDR1as) were more ubiquitously expressed, most were found in only a few cell types (Figure 3D). To assess whether circRNAs were any more cell type specific than their linear counterparts, we compared the Jensen-Shannon specificity scores [19] of circRNAs with those of a cohort of linearly spliced exon pairs with the same distribution of expression levels (that is, the same distribution of total junction-spanning reads) as the circRNA set. The expression of circular junctions was not more cell type-specific than that of the control cohort of linear junctions (Figure 3E), and the expression of both was less cell type-specific than that of lincRNAs [19]. To test whether the efficiency of circularization might be regulated in a cell-type-specific manner, we ex- amined the circular fractions of 1,299 circRNAs for which the availability of both the donor and the acceptor sites were each supported by ≥5 reads in all 39 samples. The circular fractions of these circRNAs were nearly as corre- lated between cell types (median Spearman’s ρ = 0.60 to 0.75) (Figure 3F) as they were between biological repli- cates (median Spearman’s ρ = 0.75). Taken together, our results suggested that circRNA expression is not any more regulated than expected from the availability of the pri- mary transcripts. We compiled a list of 57 circRNAs, including CDR1as, for which the circular fraction was ≥50% in most cell types in which transcript isoforms were detected (Table S2B in Additional file 3). To examine their subcellular localization, we quanti- fied the circular fractions of circRNAs in each of the subcellularly fractionated K562 samples, focusing on the 514 circRNAs detected in the K562 whole-cell sam- ples (Additional file 7). Consistent with previous results on a few circRNAs [12,14], most of these circRNAs were predominantly in the poly(A)-depleted cytoplas- mic samples. Conservation of circRNAs between human and mouse Using the non-poly(A)-selected RNA-seq data from mouse ENCODE cell lines and some other available non-poly(A)-selected RNA-seq datasets (Table S1B in Additional file 1), we also identified and quantified 635 ro- bustly detectable mouse circRNAs (Additional file 8). When analyzing human and mouse genes with clear one- to-one orthologs, we observed that if the mouse gene had a circRNA in our dataset, its human ortholog was likely to also have one (66%), whereas if the mouse gene did not have a circRNA in our dataset, the human gene was less likely to have one (19%) (Figure 4A). The overlap of hu- man and mouse circRNAs genes was not simply due to similarity in exon numbers between orthologs because the enrichment was still observed within subsets of mouse genes grouped by exon numbers (Additional file 9). To test whether human and mouse circRNAs arose from orthologous exons, we used whole-genome align- ments to identify the regions of the mouse genome that corresponded to the human circRNAs (no longer limiting the analysis to one-to-one orthologs) and quantified the degree to which our mouse circRNAs overlapped these re- gions. Among the 350 mouse circRNAs for which the aligned human gene orthologs also had circRNAs, about a third used the orthologous splice sites of human cir- cRNAs (a higher rate than that previously reported [14]), whereas the remaining two-thirds either partially overlapped (32%) or did not overlap (31%) with aligned human circRNA loci (Figure 4B,C). These results indi- cated that human and mouse circRNAs were often gen- erated not only from orthologous genes but also from orthologous exons. The circular fractions of mouse circRNAs (averaged across all cell types in which the transcript was represented by both donor- and acceptor- matching reads) were weakly yet significantly correlated with those of their human orthologs (Spearman’s ρ = 0.30; Figure 4D), which was slightly lower than those between any two human cell types (typically 0.60 to 0.75). The derivation of most circRNAs from coding exons complicates analysis of sequence conservation that might Guo et al. Genome Biology 2014, 15:409 Page 6 of 14 http://genomebiology.com/2014/15/7/409 provide evidence for sequence-dependent biological function of the circular isoforms. A previous analysis of 223 cir- cRNAs that both derive from coding exons and have orthologous circular isoforms in mouse reported elevated conservation levels in the third nucleotide positions of codons when compared to a control cohort of linear cod- ing exons that were chosen to match the conservation levels at the first and second codon positions [8]. We were able to reproduce these results using the previous list of circRNAs and found that the elevated conservation at the B EA C D F Figure 3 Expression of human circRNAs. (A) Levels of circRNAs in CD34+ hematopoietic progenitor cells. The expression level was estimated for each circRNA (using its circular fraction and the FPKM of the corresponding gene, which included both circular and linear isoforms) and the cumulative distribution of levels is plotted. For comparison, the levels of mRNAs with FPKM ≥0.1 are also plotted. (B) Fractions of mRNA-mapping reads estimated to derive from circRNAs. Reads derived from each circRNA were estimated as the product of the circular fraction, the gene FPKM and the length of the circRNA exonic sequence. The fraction was estimated for each sample, and the distribution of fractions is plotted. (C) Numbers of circRNAs identified in each biological sample. The number of circRNAs was tallied for each sample, and the distribution of values is plotted. (D) Numbers of samples in which ≥10% circular fraction was observed. The number of samples with ≥10% circular fraction was tallied for each circRNA, and the distribution of values is plotted. (E) Cumulative distribution of cell-type-specificity scores of circRNAs compared to mRNAs with similar overall expression levels (linear controls). (F) Unsupervised hierarchical clustering of the circular fractions of 1,299 circRNAs for which the availability of both the donor and the acceptor sites were each supported by ≥5 reads in all 39 samples. Guo et al. Genome Biology 2014, 15:409 Page 7 of 14 http://genomebiology.com/2014/15/7/409 third codon positions was robust when compared with 1,000 different control cohorts (Figure S7A in Additional file 10). Applying this analysis to our list of 130 human circRNAs with mouse orthologs also indicated elevated conservation of the third codon positions (Figure S7A in Additional file 10). Following up on this result, we com- pared the nucleotide conservation of coding exons within circRNAs to their neighboring linear coding exons, rea- soning that the neighboring linear exons would better control for transcript expression levels as well as other unanticipated factors that might correlate with circRNA identification. When using these alternative controls, we did not detect significantly elevated conservation in the third codon positions for either the previous list of circRNAs (Figure S7B in Additional file 10) or our new list (Figure 4E), which argued against the notion that sequence-dependent noncoding functions are enriched within circRNAs. No evidence for translation of circRNAs The observation that most circRNAs are cytosolic [12] and originate from protein-coding sequences raised the ques- tion of whether they could be loaded into the ribosome and be translated into polypeptides. Although circRNAs are devoid of the structures typically required for efficient translation initiation, that is, a 5′ cap and 3′ poly(A) tail, cap-independent translation has been reported for many linear mRNAs [22], and translation can proceed on cir- cRNAs once initiated from an internal ribosome entry site [23]. A few abundant circRNAs have been previously shown to be untranslated [14]. To search systematically for evidence of circRNA translation, we examined both ribo- some footprinting data and non-poly(A)-selected RNA-seq data for human U2OS cells. Of the 717 circRNAs with RNA-seq reads spanning their circular junctions, 236 had ribosome protected fragments (RPFs) spanning the RefSeq-annotated linear junctions at both splice sites. A B C D E Figure 4 Conservation between human and mouse circRNAs. (A) Analysis of enrichment in circRNAs from human orthologs of mouse genes for which circRNAs were found. Only the mouse genes that had one-to-one human orthologs were considered. (B) Extent to which mouse circRNAs align with human circRNA loci. (C) An example of conserved circRNAs, which derives from human PHF21A and mouse Phf21a loci. (D) Relationship between average circular fractions observed for circRNAs conserved in human and mouse (n = 130). Spearman’s rank correlation coefficient is shown. (E) Sequence conservation for the conserved circRNAs, compared with that of their neighboring exons. Distributions are of average mammalian phyloP scores for each of the three codon positions in circular exons and their neighboring linear exons. No significant difference was observed at any of the three positions (P > 0.1, paired Mann-Whitney test). Guo et al. Genome Biology 2014, 15:409 Page 8 of 14 http://genomebiology.com/2014/15/7/409 Strikingly, after excluding the false-positive junction- spanning reads arising from adjacent paralogous genes (12 instances), no RPF reads could be found spanning any of the remaining 224 circRNA junctions (Figure 5A), which led to uniformly zero circular fractions; that is, every informative RPF corresponded to the linear isoforms (Figure 5B). Making the reasonable assumption that trans- lation in alternative frames (which might terminate prior to reaching the circular junction) is rare, our results showed that, compared with their linear isoforms, most circular isoforms are translated far less efficiently if at all in human U2OS cells. Moreover, because trans-splicing is unlikely to affect translational initiation, the absence of RPFs mapping across the junctions that we classified as circular provided additional evidence that these junctions were indeed circular and not generated by trans-splicing. As more ribosome profiling data become available, it will be interesting to re-visit the question of whether some cir- cRNAs might be translated in other cell types or species. The potential of other circRNAs to act as miRNA sponges To search for additional miRNA sponges that resemble CDR1as, we considered several expected properties of strong miRNA sponges. First, miRNA sponges would be expected to bind many miRNA-loaded Argonaute pro- teins. Using data from high-throughput in vivo crosslink- ing experiments, which identified clusters of AGO2- crosslinking sites that indicated AGO2 binding [24-26], we compared the density of AGO2-crosslinking clusters within exons that can form circRNAs to the density within their neighboring linear exons. Exons that can form cir- cRNAs did not exhibit greater cluster densities for AGO2, with results resembling those for another RNA-binding protein, IGF2BP1 (insulin-related growth factor 2-binding protein 1) (Figure 6A). Similar analyses on 20 additional RNA-binding proteins showed that circular exons gener- ally had slightly higher cluster densities than their neigh- boring exons (Additional file 11), which could be due to either the circRNAs providing binding sites in addition to those provided by the same exons in linear isoforms, or the lack of translation of circular exons, which would pre- vent proteins from being displaced by the translocating ribosome. Strikingly, when counting the clusters of AGO2 crosslinks mapping to each circRNA [27], CDR1as had 26 clusters corresponding to miR-7 sites, which was by far the most mapping to any circRNA for any miRNA family R PM 0 2 4 B A Ci rc u la r f ra ct io n 0 0. 4 0. 8 RNA-seq RPF Linear junction (donor) Circular junction Linear junction (acceptor) RNA-seq RPF RNA-seq RPF RNA-seq RPF 0 0. 1 0. 2 0. 3 DonorAcceptor AAA(N) 0 2 4 Figure 5 No evidence for translation of human circRNAs. (A) Numbers of RNA-seq and RPF reads that spanned the linear junction at the donor end, the circular junction, and the linear junction at the acceptor end of 224 circRNAs that contained RPF reads corresponding to both linear junctions in U2OS cells. (B) Circular fractions of 224 of the circRNAs of (A), calculated using either RNA-seq or RPF reads. Guo et al. Genome Biology 2014, 15:409 Page 9 of 14 http://genomebiology.com/2014/15/7/409 BA C D E Figure 6 A search for additional circRNAs with the expected properties of miRNA sponges. (A) Frequency of AGO2-crosslinking clusters observed in circRNAs compared with that of clusters observed in their neighboring exons (left). See Figure 4E for color keys. For comparison, the analysis was repeated for a negative control, IGF2BP1 (right). No significant difference was observed between circular exons and their neighboring exons (P > 0.1, paired Mann-Whitney test). (B) Numbers of AGO2-crosslinking clusters assigned to individual miRNA families. The number of crosslinking clusters was tallied for each circRNA-miRNA pair, and the distribution of values is plotted. The outlying CDR1as-miR-7 pair is indicated. (C) Numbers of 7- and 8-nucleotide sites for individual miRNA families found within each circRNA. The number of sites was tallied for each circRNA-miRNA pair, and the distribution of values is plotted. The black curve indicates the averaged results when repeating the analysis 1,000 times using different permutations of the site sequences. The two outlying pairs are indicated. (D) Numbers of miRNA target sites in CDR1as and top-ranking ZNF circRNAs. (E) Part of the ZNF91 locus containing the circRNA. miR-23 and miR-296 seed matches are indicated. Guo et al. Genome Biology 2014, 15:409 Page 10 of 14 http://genomebiology.com/2014/15/7/409 (Figure 6B). No other circRNA stood out as a candidate to act as a strong sponge for any of the other RNA-binding proteins. Because the AGO2-crosslinking sites were determined in HEK293 cells, circRNAs and miRNAs not expressed in HEK293 cells were missed by this analysis. We thus concatenated the annotated exons within each circRNA, and counted the number of canonical 7- and 8-nucleotide target sites [7] for each of the 87 miRNA families con- served across vertebrates. Again, CDR1as ranked on top, containing 71 miR-7 sites (Figure 6C). CDR1as-miR-7 was also the only circRNA-miRNA pair that exceeded the upper limit of results from the negative control, in which the analysis was repeated with permutated miRNA se- quences (Figure 6C). We conclude that among the human circRNAs, CDR1as stands alone as the most compelling miRNA sponge for any conserved miRNA seed family. Our analysis of miRNA site number also pointed to circRNAs from the repeat-rich C2H2 zinc finger (ZNF) gene family (Figure 6D). In particular, a circRNA gener- ated from the ZNF91 locus (circRNA-ZNF91) contains 24 miR-23 sites (Figure 6E), 19 of which were 8-nucleotide sites. These numbers exceeded that of the other proposed miRNA sponge, mouse Sry, which has 16 miR-138 sites [9]. ZNF91 belongs to a C2H2 zinc finger (ZNF) gene fam- ily that is greatly expanded in the primate lineage and known to contain exceptionally abundant target sites for several miRNA families, including miR-23, miR-181 and miR-199 [28]. The next nine ZNF circRNAs ranked by the total number of sites for these three miRNA families had 7 to 15 sites to one of the 3 families (Figure 6D). Expand- ing our miRNA site search beyond the 87 miRNA families conserved beyond mammals to the 66 miRNA families conserved only within the mammalian lineage (Figure S9A in Additional file 12), we found that circRNA-ZNF91 had 39 additional sites for miR-296 (Figure 6E). CDR1as also had 22 sites for the miR-876-5p/3167 family (Figure S9B in Additional file 12), although they were not as conserved as the miR-7 sites. Discussion Because molecular studies of eukaryotic RNA typically begin with poly(A)-selection, circRNAs have often es- caped detection and consideration. Our study adds to previous circRNA annotation efforts [8,12,14,15] to yield an expanded catalog of circRNAs robustly detected from a large variety of human cell types. Our circRNA identi- fication method resembles that previously used [8,14], except we focused our analyses on the circRNA loci with circular fractions ≥10%. Other recent studies take a more targeted approach and search for back-spliced junctions from annotated splice sites [12,15] and there- fore miss the unannotated genes and exons, especially those that have particularly high circular fractions and are rarely found in the poly(A)+ RNA-seq data, such as CDR1as. Moreover, unlike previous studies that identify circRNAs from poly(A)-depleted RNA-seq data [14,15], we applied our pipeline to non-poly(A)-selected RNA- seq data, which were neither depleted nor enriched in circRNAs or their linear isoforms. An advantage of using these datasets is that we could directly estimate circular fractions without experimental calibration [15]. With this catalog of 7,112 human circRNAs in hand, the key question is whether they comprise an underap- preciated class of molecules with cellular functions, or whether they are largely inert side-products of imperfect pre-mRNA splicing. The circRNA with the most com- pelling evidence for a biological function is the miR-7 sponge, CDR1as. Although a biological context has not yet been identified in which CDR1as loss-of-function in- fluences miR-7 activity, this circRNA has >60 conserved sites to miR-7 and a developmental phenotype following its ectopic delivery [8,9]. The other circRNA proposed to act as a miRNA sponge, mouse Sry [9], has only one miR-138 site in its human homolog, which indicates that the proposed sponge function is not conserved in mammals. What about functional potential of the other 7,000-plus circRNAs? By characterizing the molecular abundance and translation of circRNAs and providing an updated perspective on their sequence conservation and potential to act as miRNA sponges, our analyses can speak to this question. Although we found thousands of circRNAs in each cell type, only approximately 2% (20 to 60, depending on the cell type) had circular fractions exceeding 50%, which indicates that most were minor alternative isoforms of their respective primary transcripts. Moreover, fewer than 10% had FPKMs ≥10 in any of the 39 samples exam- ined. Considering that in homogeneous cell types one molecule per cell usually corresponds to an FPKM of 1 to 4 [29], most circRNAs only accumulated to a few mole- cules per cell. This generally low circular fraction and weak accumulation was observed despite the expectation that each circRNA, by virtue of its exonuclease insuscepti- bility, might persist in the cell much longer than its linear alternative isoforms. Such low accumulation would not be expected of molecules that titrate miRNAs or other abun- dant regulators away from their regulatory targets. Indeed, we find few circRNAs with the properties expected of miRNA sponges. When circRNAs are experimentally enriched by either poly(A)-depletion [15] or RNase R di- gestion [14], tens of thousands of more circRNAs are found, even when limiting the search to only those that use annotated splice sites. Many of these low-abundance circRNAs have zero junction-spanning reads when we searched in the non-poly(A)-selected RNA-seq data, in which circRNAs were neither enriched nor depleted (Additional file 5). Perhaps it is not too far-fetched to Guo et al. Genome Biology 2014, 15:409 Page 11 of 14 http://genomebiology.com/2014/15/7/409 speculate that all multi-exon genes generate one or more circular isoforms at low frequencies, whereas circularization of CDR1as is specific and efficient in all cell types in which it is expressed. To have a physiological effect at such low levels, cir- cRNAs would need to either participate in a catalytic process or interact very specifically with other molecules that have important functions when present at very low cellular levels. For example, mRNAs have physiological effects when present at only a few molecules per cell be- cause they participate in the catalytic process of transla- tion, which can produce many protein molecules from each mRNA molecule. However, we found that circRNAs are rarely translated. Some linear lincRNAs are proposed to interact with and modulate the output of a single gen- omic locus, which would explain their physiological effect despite their relatively low cellular abundance [5]. Like- wise, a rare circRNA could conceivably recognize and regulate a rare mRNA. However, a specific, high-affinity interaction with an mRNA or other rare cellular compo- nent would presumably rely on the circRNA sequence, which would need to be conserved to retain its function over evolutionary time, yet we found no evidence for cir- cRNA sequence conservation beyond that observed for neighboring linear exons. We suspect that CDRas is not the only circRNA with an evolutionarily conserved biological function. This being said, our observations that most circRNAs 1) are ineffi- ciently produced relative to their linear alternative iso- forms, 2) accumulate to only low levels in the cell, and 3) are no more conserved than their neighboring linear exons, when considered together, suggest that most cir- cRNAs may be inconsequential side-products of imperfect pre-mRNA splicing. For linear alternative-spliced iso- forms, preferential production of orthologous isoforms in the same tissues of different species is considered evidence of function [30,31]. For circular isoforms, this type of ana- lysis would require non-poly(A)-selected datasets from the same tissues of different species, which unfortunately are not yet available. For now, the only observation con- sistent with the idea that many circRNAs could be func- tional is our finding that the loci that produce circRNAs in mouse also tend to do so in humans. However, reten- tion of circRNA production since the last common ances- tor of mouse and human could have other causes apart from selection for circRNA function. For example, slowed splicing at the circRNA acceptor would presumably favor circRNA production because it would allow for transcrip- tion of the downstream donor, and if this slowed splicing is conserved for reasons other than circRNA function, then the production of circRNAs might nonetheless be conserved. Therefore, considering the conserved produc- tion of circRNAs as evidence against the idea that the vast majority of circRNAs are inert splicing side-products would require a more thorough understanding of the de- terminants of circRNA biogenesis. Conclusions Mammalian cells produce a large number of circRNAs, which have captured the interest of many biologists, par- ticularly after the description of CDR1as and its many conserved sites to miR-7. Our work identifies thousands of additional circRNAs and focuses on those that have circular fractions ≥10%. Unlike CDR1as, most of the pre- viously and newly identified mammalian circRNAs rep- resent alternatively spliced, low-abundance isoforms of protein-coding genes. Expression of circRNAs is gener- ally not more cell-type-specific than mRNAs with simi- lar overall expression levels. Although orthologous circRNAs were found between mouse and human, their sequence conservation is no higher than that of their neighboring linear exons, and no other identified cir- cRNA is expected to function as a miRNA sponge nearly as effectively as CDR1as. Although some circRNAs with biological functions might exist, our results suggest that a large majority of circRNAs are inconsequential side- products of pre-mRNA splicing. Materials and methods circRNA identification and quantification Human and mouse Ribo-Zero RNA-seq data were down- loaded from either the ENCODE project or Gene Expres- sion Omnibus (GEO). For each sample, Fastq reads were first mapped to hg19 or mm9 genome by Bowtie, allowing 2 mismatches. After removing PCR-duplicated reads by FASTX toolkit, all the unmapped reads were then aligned by BLAT (no mismatch or gap allowed). Dual alignments of two complimentary segments within a single read map- ping to two regions on the same chromosome in the re- verse order and no more than 100 kb away from each other were selected as circular-junction candidates. Next, GT and AG dinucleotides were searched for within 10 nu- cleotides genomic windows flanking the donor and ac- ceptor end of each junction, respectively. Candidates with GT-AG-flanking junctions were carried forward, and the GT-AG dinucleotides were used to identify the precise splice sites. For human circRNAs, each junction required support from at least two independent reads within the sample. To quantify the relative ratio of circular and linear iso- forms, we focused on the two segments (20 nucleotides upstream from the donor and 20 nucleotides downstream from the acceptor) flanking the circular junction. Because many linear isoforms may exist for a given splice site, we took an inclusive approach and simply counted the reads that contained either of these two sequences and have Guo et al. Genome Biology 2014, 15:409 Page 12 of 14 http://genomebiology.com/2014/15/7/409 enough sequence space for the other sequence (ndonor and nacceptor), and the reads that spanned the circular junction and contained both sequences (njunction). The circular fraction is calculated as njunction / (ndonor + nacceptor – njunction + 1). To be accepted into the final circRNA catalog, a circRNA candidate must have a circular frac- tion ≥ 10% in at least two samples. Conservation analyses One-to-one gene ortholog tables for gene-level analysis were downloaded from Ensembl [32]. For exon-level ana- lysis, human circRNA junction coordinates were con- verted to mouse (mm9) genome coordinates using the UCSC liftOver tool, then intersected with mouse circRNA junctions using BEDTools. To calculate the correlation of average circular fractions of circRNA orthologs, circular fractions of each circRNA in all cell types wherein it was expressed (≥1 read for each of the donor and acceptor ends) were averaged. Spearman’s rank correlation test was performed. Analysis of translation Twenty-nucleotide sequences were taken from circular junctions and each of the two linear junctions overlapping the circular junctions (10 nucleotides from each side of each junction). Numbers of reads containing each of these sequences, as well as the circular fractions for each cir- cRNA, were compared using RNA-seq and RPF data from human U2OS cells. miRNA and protein binding sites PAR-CLIP data were downloaded from the GEO. After read alignment by Bowtie, binding clusters were identified using PARalyzer with default settings [24]. Cluster dens- ities of all circular exons were calculated and compared to those of their linear neighboring exons. To avoid biases, only coding exons were considered. To quantify miRNA targets sites, exonic segments within each circRNA were concatenated using the transcript models built from all ENCODE cytosolic RNA-seq data, and numbers of canon- ical miRNA sites (7mer-A1, 7mer-m8, and 8mer sites) [7] for the 87 miRNA families conserved across vertebrates and 66 miRNA families conserved across mammals were quantified for each circRNA. To estimate the distribution of sites expected by chance, the procedure was repeated using 1,000 cohorts consisting of 87 or 66 control k-mers. To select a control k-mer, each 8mer site was randomly permuted to preserve its mononucleotide composition. Permutated sequences were chosen if they preserved the CG dinucleotide number and possessed an A at the 3′-most nucleotide. Collectively, these constraints served to select control k-mers with similar expected genome- wide abundance. Data availability RNA-seq and RPF data of human U2OS cells have been deposited in GEO under accession number GSE51584. Additional files Additional file 1: Table S1. Non-poly(A)-selected RNA-seq data used in this study. Additional file 2: Figure S1. Sequence characteristics of circular junctions. Additional file 3: Table S2. Human circRNA catalog. Additional file 4: Figure S2. Length distribution of circRNAs. Additional file 5: Figure S3. Comparison between circRNA annotations. Additional file 6: Figure S4. Relationship between number of circRNAs detected in each sample and sequencing depth. Additional file 7: Figure S5. Subcellular localization of circRNAs in K562 cells. Additional file 8: Table S3. Mouse circRNA catalog. Additional file 9: Figure S6. Enrichment in circRNAs from human orthologs of mouse genes for which circRNAs were found. Additional file 10: Figure S7. Protein-coding-independent conservation of circRNAs. Additional file 11: Figure S8. Frequency of crosslinking clusters observed in circRNAs compared to that of clusters observed in their neighboring exons. Additional file 12: Figure S9. Sites for mammal-specific miRNA families found within each circRNA. Abbreviations circRNA: circular RNA; FPKM: fragments per kilobase of transcript per million fragments sequenced; GEO: Gene Expression Omnibus; lincRNA: long intervening non-coding RNA; miRNA: microRNA; ncRNA: non-protein-coding RNA; RPF: ribosome protected fragment; UTR: untranslated region; ZNF: zinc finger. Competing interests The authors declare that they have no competing interests. Authors’ contributions JUG led the project and performed most of the analyses. VA contributed to project design and performed miRNA site analyses. HG collected the U2OS RNA-seq and RPF data. DPB supervised the project. JUG, VA and DPB wrote the manuscript. All authors read and approved the final manuscript. Acknowledgements We thank C. Burge, S. Eichhorn, I. Ulitsky and O. Rissland for helpful discussions and suggestions. This work was supported by NIH grant GM067031 (D.P.B.), and a National Science Foundation Graduate Research Fellowship (V.A.). J.U.G. is a Damon Runyon Fellow supported by the Damon Runyon Cancer Research Foundation (DRG-2152-13). H.G. was supported by the Agency for Science, Technology and Research, Singapore. D.P.B. is an investigator of the Howard Hughes Medical Institute. Author details 1Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA. 2Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA. 3Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. 4Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. 5Current address: Institute of Molecular and Cell Biology, Singapore 138673, Singapore. 6Current address: Department of Biological Sciences, National University of Singapore, Singapore 117543, Singapore. 7Current address: Lee Kong Chian School of Medicine, Nanyang Technological University-Imperial College, Singapore 639798, Singapore. Guo et al. Genome Biology 2014, 15:409 Page 13 of 14 http://genomebiology.com/2014/15/7/409 Received: 9 April 2014 Accepted: 29 July 2014 Published: 29 July 2014 References 1. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J: Landscape of transcription in human cells. Nature 2012, 489:101–108. 2. Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, Martin D, Merkel A, Knowles DG, Lagarde J, Veeravalli L, Ruan X, Ruan Y, Lassmann T, Carninci P, Brown JB, Lipovich L, Gonzalez JM, Thomas M, Davis CA, Shiekhattar R, Gingeras TR, Hubbard TJ, Notredame C, Harrow J, Guigó R: The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 2012, 22:1775–1789. 3. Sabin LR, Delas MJ, Hannon GJ: Dogma derailed: the many influences of RNA on the genome. Mol Cell 2013, 49:783–794. 4. Guttman M, Rinn JL: Modular regulatory principles of large non-coding RNAs. Nature 2012, 482:339–346. 5. Ulitsky I, Bartel DP: lincRNAs: genomics, evolution, and mechanisms. Cell 2013, 154:26–46. 6. Batista PJ, Chang HY: Long noncoding RNAs: cellular address codes in development and disease. Cell 2013, 152:1298–1307. 7. Bartel DP: MicroRNAs: target recognition and regulatory functions. Cell 2009, 136:215–233. 8. Memczak S, Jens M, Elefsinioti A, Torti F, Krueger J, Rybak A, Maier L, Mackowiak SD, Gregersen LH, Munschauer M, Loewer A, Ziebold U, Landthaler M, Kocks C, Ie Noble F, Rajewsky N: Circular RNAs are a large class of animal RNAs with regulatory potency. Nature 2013, 495:333–338. 9. Hansen TB, Jensen TI, Clausen BH, Bramsen JB, Finsen B, Damgaard CK, Kjems J: Natural RNA circles function as efficient microRNA sponges. Nature 2013, 495:384–388. 10. Hansen TB, Wiklund ED, Bramsen JB, Villadsen SB, Statham AL, Clark SJ, Kjems J: miRNA-dependent gene silencing involving Ago2-mediated cleavage of a circular antisense RNA. EMBO J 2011, 30:4414–4422. 11. Huntzinger E, Izaurralde E: Gene silencing by microRNAs: contributions of translational repression and mRNA decay. Nat Rev Genet 2011, 12:99–110. 12. Salzman J, Gawad C, Wang PL, Lacayo N, Brown PO: Circular RNAs are the predominant transcript isoform from hundreds of human genes in diverse cell types. PLoS One 2012, 7:e30733. 13. Danan M, Schwartz S, Edelheit S, Sorek R: Transcriptome-wide discovery of circular RNAs in Archaea. Nucleic Acids Res 2012, 40:3131–3142. 14. Jeck WR, Sorrentino JA, Wang K, Slevin MK, Burd CE, Liu J, Marzluff WF, Sharpless NE: Circular RNAs are abundant, conserved, and associated with ALU repeats. RNA 2013, 19:141–157. 15. Salzman J, Chen RE, Olsen MN, Wang PL, Brown PO: Cell-type specific features of circular RNA expression. PLoS Genet 2013, 9:e1003777. 16. Capel B, Swain A, Nicolis S, Hacker A, Walter M, Koopman P, Goodfellow P, Lovell-Badge R: Circular transcripts of the testis-determining gene Sry in adult mouse testis. Cell 1993, 73:1019–1030. 17. Nigro JM, Cho KR, Fearon ER, Kern SE, Ruppert JM, Oliner JD, Kinzler KW, Vogelstein B: Scrambled exons. Cell 1991, 64:607–613. 18. Sharp PA, Burge CB: Classification of introns: U2-type or U12-type. Cell 1997, 91:875–879. 19. Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL: Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev 2011, 25:1915–1927. 20. Al-Balool HH, Weber D, Liu Y, Wade M, Guleria K, Nam PL, Clayton J, Rowe W, Coxhead J, Irving J, Elliott DJ, Hall AG, Santibanez-Koref M, Jackson MS: Post-transcriptional exon shuffling events in humans can be evolutionarily conserved and abundant. Genome Res 2011, 21:1788–1799. 21. Caudevilla C, Serra D, Miliar A, Codony C, Asins G, Bach M, Hegardt FG: Natural trans-splicing in carnitine octanoyltransferase pre-mRNAs in rat liver. Proc Natl Acad Sci U S A 1998, 95:12185–12190. 22. Gilbert WV: Alternative ways to think about cellular internal ribosome entry. J Biol Chem 2010, 285:29033–29038. 23. Chen CY, Sarnow P: Initiation of protein synthesis by the eukaryotic translational apparatus on circular RNAs. Science 1995, 268:415–417. 24. Corcoran DL, Georgiev S, Mukherjee N, Gottwein E, Skalsky RL, Keene JD, Ohler U: PARalyzer: definition of RNA binding sites from PAR-CLIP short-read sequence data. Genome Biol 2011, 12:R79. 25. Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A, Ascano M Jr, Jungkamp AC, Munschauer M, Ulrich A, Wardle GS, Dewell S, Zavolan M, Tuschl T: Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell 2010, 141:129–141. 26. Kishore S, Jaskiewicz L, Burger L, Hausser J, Khorshid M, Zavolan M: A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins. Nat Methods 2011, 8:559–564. 27. Hafner M, Lianoglou S, Tuschl T, Betel D: Genome-wide identification of miRNA targets by PAR-CLIP. Methods 2012, 58:94–105. 28. Schnall-Levin M, Rissland OS, Johnston WK, Perrimon N, Bartel DP, Berger B: Unusually effective microRNA targeting within repeat-rich coding regions of mammalian mRNAs. Genome Res 2011, 21:1395–1403. 29. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5:621–628. 30. Merkin J, Russell C, Chen P, Burge CB: Evolutionary dynamics of gene and isoform regulation in Mammalian tissues. Science 2012, 338:1593–1599. 31. Barbosa-Morais NL, Irimia M, Pan Q, Xiong HY, Gueroussov S, Lee LJ, Slobodeniuc V, Kutter C, Watt S, Colak R, Kim T, Misguitta-Ali CM, Wilson MD, Kim PM, Odom DT, Frey BJ, Blencowe BJ: The evolutionary landscape of alternative splicing in vertebrate species. Science 2012, 338:1587–1593. 32. Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E: EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res 2009, 19:327–335. doi:10.1186/s13059-014-0409-z Cite this article as: Guo et al.: Expanded identification and characterization of mammalian circular RNAs. Genome Biology 2014 15:409. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit Guo et al. Genome Biology 2014, 15:409 Page 14 of 14 http://genomebiology.com/2014/15/7/409 246 Curriculum Vitae Vikram Agarwal Education: Massachusetts Institute of Technology, Cambridge, MA, 2009 – 2015 Ph.D. in Computational and Systems Biology Advisor: David P Bartel University of Texas at Austin, Austin, TX, 2005 – 2009 B.S. in Biology: Honors Research Experiences: University of Texas at Austin, Austin, TX, 2006 – 2009 Advisor: Z. Jeffrey Chen Computational characterization of miRNAs and their targets in developing cotton fibers University of Texas at Austin, Austin, TX, 2007 – 2008 Advisor: John Wallingford RNA structural elements guide mRNA localization in Xenopus laevis Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 2007 Advisor: Lincoln Stein Characterizing coverage and chromosomal rearrangement in the Watson genome Biomedical Research Institute/LSU Health Sciences Center, Shreveport, LA, 2005 Advisor: Steven Alexander Immune roles in Alzheimer’s and inflammatory bowel disease Biomedical Research Institute/LSU Health Sciences Center, Shreveport, LA, 2004 Advisor: Anping Chen Mechanism of curcumin in sensitizing human colon cancer cells Biomedical Research Institute/LSU Health Sciences Center, Shreveport, LA, 2003 Advisor: Adrian Dunn Impact of cytokines on motor activity and appetite in mice Teaching Experience: Teaching Assistant, Foundations of Computational & Systems Biology (7.91), Spring 2012, Massachusetts Institute of Technology Teaching Assistant, MIT Quantitative Biology Workshop, Independent Activities Period (IAP), Jan 2012 and Jan 2013, Massachusetts Institute of Technology 247 Publications: Agarwal V, Subtelny AO, Jan CH, Ulitsky I, Bell GW, Bartel DP. "Evolutionary and quantitative models of Drosophila microRNA targeting". (In preparation). Wong SFL*, Agarwal V*, Mansfield JH, Denans N, Schwartz MG, Prosser HM, Pourquié O, Bartel DP, Tabin CJ, McGlinn E. "Independent regulation of vertebral number and vertebral identity by microRNA-196 paralogs". 2015. Proceedings of the National Academy of Sciences USA. doi: 10.1073/pnas.1512655112. Agarwal V, Bell GW, Nam J-W, Bartel DP. "Predicting effective microRNA target sites in mammalian mRNAs". 2015. eLife 4:e05005. 1-38. Guo JU, Agarwal V, Guo H, Bartel DP. "Expanded identification and characterization of mammalian circular RNAs". 2014. Genome Biology 15(7):409. 1-14. Denzler R, Agarwal V, Stefano J, Bartel DP, Stoffel M. "Assessing the ceRNA hypothesis with quantitative measurements of miRNA and target abundance". 2014. Molecular Cell 54(5):766-776. Nam J-W, Rissland OS, Koppstein D, Abreu-Goodger C, Jan CH, Agarwal V, Yildirim MA, Rodriguez A, Bartel DP. "Global analyses of the effect of different cellular contexts on microRNA targeting". 2014. Molecular Cell 53(6):1031-43. Pang M*, Woodward AW*, Agarwal V*, Guan X, Ha M, Ramachandran V, Chen X, Triplett BA, Stelly DM, Chen ZJ. "Genome-wide analysis reveals rapid and dynamic changes in miRNA and siRNA sequence and expression during ovule and fiber development in allotetraploid cotton (Gossypium hirsutum L)". 2009. Genome Biology 10(11):R122. 1-21. Ha M, Pang M, Agarwal V, Chen ZJ. "Interspecies regulation of microRNAs and their targets". 2008. Biochim Biophys Acta 1779(11):735-742. *These authors contributed equally to the work and are shared co-first authors Selected Talks: "Independent Regulation of Vertebral Number and Vertebral Identity by microRNA- 196 Paralogs". Jul 2014. Society for Developmental Biology 73rd Annual Meeting, University of Washington. Seattle, WA. "Predicting effective microRNA target sites in mammalian mRNAs". May 2014. 9th Microsymposium on Small RNAs, Institute of Molecular Biotechnology. Vienna, Austria. "Quantitative Models of Vertebrate and Drosophila MicroRNA Targeting". Oct 2013. Institute of Molecular Health Sciences, Swiss Federal Institute of Technology (ETH Zürich). Zürich, Switzerland. Awards/Achievements/Memberships: 2009 – NSF Graduate Research Fellowship (GRFP) 2008 – Barry M. Goldwater Scholarship 2008 – University of Texas Distinguished Scholar 2008 – Unrestricted Endowed Presidential Scholarship, UT Austin 248