ClusterMap: multi-scale clustering analysis of spatial gene expression

Quantifying RNAs in their spatial context is crucial to understanding gene expression and regulation in complex tissues. In situ transcriptomic methods generate spatially resolved RNA profiles in intact tissues. However, there is a lack of a unified computational framework for integrative analysis of in situ transcriptomic data. Here, we present an unsupervised and annotation-free framework, termed ClusterMap, which incorporates physical proximity and gene identity of RNAs, formulates the task as a point pattern analysis problem, and thus defines biologically meaningful structures and groups. Specifically, ClusterMap precisely clusters RNAs into subcellular structures, cell bodies, and tissue regions in both two- and three-dimensional space, and consistently performs on diverse tissue types, including mouse brain, placenta, gut, and human cardiac organoids. We demonstrate ClusterMap to be broadly applicable to various in situ transcriptomic measurements to uncover gene expression patterns, cell-cell interactions, and tissue organization principles from high-dimensional transcriptomic images.

However, it is challenging to directly extract low-dimensional representations of biological patterns from high-dimensional spatial transcriptomic data.
One main challenge is to achieve accurate and automatic cell segmentation that accurately assigns RNAs into individual cells for single-cell analysis. The most common cell segmentation strategy is labeling cell nuclei or cell bodies by fluorescent staining 9-11 (e.g., DAPI, Nissl, WGA, etc.) and then segmenting the continuous fluorescent signals by conventional or machine learning (ML)-based methods 12 . However, conventional methods, such as distance-transformed watershed 13 , require manual curation to achieve optimal segmentation. On the other hand, while ML-based methods 14,15 can automatically detect the targets (cells) in fluorescent stainings, they still require manually annotated datasets for model training and have poor generalization ability to other datasets.
In order to address these challenges, a fundamentally different approach that bypasses auxiliary cell staining, hyperparameter tuning, and manual labeling is warranted. Here, instead of using fluorescent staining, we directly utilized the patterns of spatially resolved RNAs that intrinsically encode high-dimensional gene expression information for subcellular and cellular segmentation, followed by cell-type spatial mapping. To leverage the spatial heterogeneity of RNA-defined cell types, we applied the same strategy to cluster discrete cells into tissue regions. Together, we demonstrated that this computational framework (termed ClusterMap) can identify subcellular structures, cells, and tissue regions (Fig. 1).

ClusterMap integrates spatial and gene expression analyses
ClusterMap is based on two key biological phenomena. First, the density of RNA molecules is higher inside cells than outside cells; second, cellular RNAs encoded by different genes are enriched at different subcellular locations, cell types, and tissue regions 16,17 . Thus, we reasoned that we could identify biologically meaningful patterns and structures directly from in situ transcriptomic data by joint clustering the physical density and gene identity of RNAs.
Subsequently, the spatial clusters were interpreted based on the gene identity and spatial scales to represent subcellular localization, cell segmentation, and region identification.
ClusterMap started with pre-processed imaging-based in situ transcriptomic data (Methods), where raw fluorescent images were converted into discrete RNA spots with a physical 3D location and a gene identity (i.e. mRNA spot matrix, Fig. 1a). We reasoned that spatial clusters can be distinguished based on the gene expression in the local neighborhood of each RNA spot.
To quantify this, we introduced multidimensional coordinates, termed neighborhood gene composition (NGC) coordinates, which were computed by considering gene expression profiles in a circular window over each RNA spot ( Fig. 1b (I), Methods). ClusterMap is capable of analysis on different spatial resolutions by changing the size of the window. The size of the window is specifically chosen for the same dataset to match the average size of organelles or cells for subcellular or single-cell analysis, respectively (Methods). The NGC coordinates and physical coordinates of each RNA spot are then computationally integrated into joint physical and NGC (P-NGC) coordinates over each spot.
Next, we aimed to cluster the RNAs in the P-NGC coordinates for downstream segmentation.
Out of numerous clustering algorithms, density peak clustering (DPC) 18 , a type of density-based clustering method, was chosen for its versatility in extracting biological features in data and its compatibility with clusters of various shapes and dimensionalities automatically. DPC identifies cluster centers with a higher density than the surrounding regions as well as a relatively large distance from points with higher densities. We applied DPC to compute two variables 18 : local density ρ and distance δ for each spot in the joint P-NGC space. For each spot, ρ value represents the density of its closely surrounded spots, and δ value represents the minimal distance to spots with higher ρ values. Spots with both high ρ and δ values are highly likely to be cluster centers.
We then ranked the product of these two variables, γ, in decreasing order to find genuine clusters with orders of magnitude higher γ values (Methods). For example, in Fig. 1b, the two spots with the γ values that are orders of magnitude higher than other spots are chosen as cell centers (labeled by a red star and a cyan hexagon, Fig. 1b (II)). After selecting the two cluster centers, the remaining spots are assigned to one of the clusters respectively in a descending order of ρ value. Each is assigned to the same cluster as its nearest cluster-assigned neighbor 14 , and each cluster of spots is taken to represent an individual cell ( Fig. 1b (III)). Outliers that were falsely assigned among cells can be filtered out using noise detection in DPC 18 or cell stains. which can be analyzed downstream for purposes such as cell typing, etc. (Fig. 1b (IV)). To illustrate this framework, we applied ClusterMap to representative in situ transcriptomics data from mouse brain tissue collected by the STARmap method 6 (Fig. 1c).
Next, we examined and validated the performance of ClusterMap in diverse biological samples at different spatial scales in both 2D and 3D (Fig. 1d). First, based on the assumption that cellular RNAs have a different distribution in the nucleus or cytoplasm 19 , we used ClusterMap to cluster mRNAs within one cell to define the nuclear boundary. Here, RNA spots with both highly correlated neighboring composition and close spatial distances were merged into a single signature ( Supplementary Fig. 1a, Methods). Then, a convex hull was constructed from the nucleus spots, denoting the nuclear boundary. The patterns of ClusterMap-constructed nuclear boundaries were highly correlated with DAPI stains, confirming the power of ClusterMap for segmentation at the subcellular resolution ( Fig. 1d (I)). Second, we compared cell segmentation results by ClusterMap with conventional watershed 13 segmentation (Methods) on the same mouse cortex cells. Comparing with the conventional watershed method, ClusterMap accurately identified cells, more precisely outlined cell boundary and illustrated cell morphology (Fig. 1d (II)). Last, we extended ClusterMap to diverse tissue types at different scales in both 2D and 3D, where dense heterogeneous populations of cells with arbitrary shapes exist. Cell identification results for the mouse cerebellum, the ileum, and the cortex are shown in Fig. 1d (III)-(V).

Spatial clustering analysis in mouse brain
We first demonstrated ClusterMap on the mouse primary visual cortex from the STARmap mouse primary cortex (V1) 1020-gene dataset 6 (Supplementary Table 1). When sequenced transcripts were more likely to populate the cytoplasm, sparsely sampled spots based on DAPI signals were combined with RNAs to compensate for the lack of signals in the centers of cells, and they were together processed with modified ClusterMap procedures (Fig. 2a, Methods). The results show clear cell segmentation even for strongly crowded mouse V1 cortex cells (Fig. 2b).
Additionally, we evaluated whether ClusterMap-identified cell center coordinates were within corresponding expert-labeled cell regions on eight STARmap mouse V1 datasets to validate its accuracy ( Supplementary Fig. 1b, c). Notably, ClusterMap cell labeling reached accuracy levels of 80-90% (Methods). Furthermore, we integrated gene expression in either ClusterMapidentified or expert-labeled cells with scRNA-seq data and compared their correlation 20,21 ( Supplementary Fig. 2d, Methods). Again, ClusterMap exhibited a comparable performance with expert-annotated segmentation. In the mouse V1 cortex dataset, ClusterMap identified cell The next challenge was to apply ClusterMap on the cell-typing map to identify tissue regions. In this case, ClusterMap further clustered cells based on their physical and cell-type identity, providing similar clustering analyses of physical and high-dimensional cell-type information.
ClusterMap computed a neighborhood cell-type composition (NCC) coordinates of each cell 23 and then clustered joint physical and NCC coordinates of cells ( Supplementary Fig. 1d, Methods). As a result, cells with both highly correlated neighboring cell-type composition and close spatial distances are clustered into a single tissue region signature. The results showed that ClusterMap accurately detected cortical layering, which allows for the quantification of cell-type composition of each cortical layer (Fig. 2b-e). The distinct region-specific distribution of excitatory neurons can be observed in the L2/3, L4, L5, and L6 canonical layers, while oligodendrocytes were significantly distributed within the corpus callosum layer. In summary, ClusterMap can effectively, accurately, and automatically conduct cell segmentation, cell typing, and tissue region identification.

ClusterMap enables spatial clustering and cell-cell interaction analyses in mouse placenta
To further demonstrate the generality of ClusterMap, especially its applicability to tissues with high cell density and variable nuclear/cytosolic distribution of RNAs, we applied ClusterMap to the STARmap mouse placenta 903-gene dataset (Fig. 3a,   The discovery of the interwovenness of different tissue regions in placenta samples suggests the rich patterns of cell-cell interactions. We further sought to use ClusterMap results to characterize the near-range cell interaction networks by generating a mesh graph via Delaunay triangulation of cells and modeling the cellular relationships based on the i-niche concept 24 . In this way, we identified the nearest neighbors of each cell which were directly contacting each other (Fig. 4a-c) and quantified the average number of cells per cell-type among the first-tier neighbors (Fig. 4e), which could reveal crucial information about the affinity and communication among different cell types. Through this methodology, we discovered cell-type-specific cellular interactions: MD-1, MD-2, trophoblast giant-1 (TG-1), and NK cells mainly self-aggregate; glandular trophoblast-2 (GT-2), TG-2, TG-3, and endothelial/stromal cells widely connect with four types of cells; and Spongiotrophoblast -1 and Spongiotrophoblast -2 cells have a high affinity to each other. We envision that identifying cell-cell interaction facilitates future in-depth studies of the biological mechanisms and physiology of placenta and placenta-related diseases.

ClusterMap is applicable across various in situ transcriptomic methods
Beyond just STARmap 6 , we further applied ClusterMap to analyze mouse brain tissue from three other in situ transcriptomics methods. Analyses of the imaged transcripts in the hypothalamic preoptic region by MERFISH 3 , the isocortex region by pciSeq 4 , and the somatosensory cortex by In conclusion, we analyzed mouse brain data from four representative in situ transcriptomic methods and validated the utility and universality of ClusterMap under different experimental methods. ClusterMap successfully produced comparable results across different methods with negligible modification applied.

3D ClusterMap analyses in thick tissue blocks
3D in situ transcriptomics data analysis is considered even more challenging because it is generally infeasible by manual labeling. However, 3D volumetric imaging and analysis are required to understand the structural and functional organization of complex organs. In this regard, exploring ClusterMap's ability to analyze 3D in situ transcriptomics is particularly desired. We applied ClusterMap to two 3D thick-tissue samples: STARmap cardiac organoid 8gene dataset 25 and STARmap mouse V1 28-gene dataset 6 (Supplementary Table 1). We analyzed the 3D data following the sample protocol described in Fig. 1b Fig. 5a-c). The 100-μm-thick sample of mouse V1 includes all six cortical layers and the corpus callosum, in which up to 24,000 cells were identified and 3D clustered into eleven cell types (Fig. 6d, e, Supplementary Fig. 5d-g). Our results showed similar spatial distribution with previously published results, which used the conventional fluorescence image segmentation: excitatory neurons exhibited a gradient distribution, with the spatial density of each subtype gradually decaying to adjacent layers across the entire 3D space; inhibitory neurons showed a more dispersed distribution; and non-neuronal cells were largely located in the white matter and layer 1 (Fig. 6e). We can determine seven 3D tissue regions based on their corresponding cell-type compositions (Fig.6 f, g). We further characterized 3D cell-cell interactions in the mouse V1 and computed the average compositional neighboring cell types ( Fig. 6h-k). In the minority inhibitory neurons, we observed a similar self-associative pattern as in previously published findings 6 : the nearest neighbor of any inhibitory neuron tends to be its own subtype. Three examples of inhibitory neuronal types (Pv, Sst, Vip) interactions are presented in Fig. 6h-j, respectively.

Discussion
Spatial RNA localization intrinsically contains information related to biological structures and cell functions, which are yet to be effectively retrieved. ClusterMap exemplifies a computational framework that combines spatial and high-dimensional transcriptomic information from in situ single-cell transcriptomics to identify subcellular, cellular, and tissue structures in both 2D and 3D space. It is widely applicable to various experimental methods including, but not limited to, Furthermore, ClusterMap is easy to scale up to a large dataset covering large-volume and organlevel imaging data. Beyond spatial transcriptomic data, ClusterMap can be generalized and applied to other 2D and 3D mapped high-dimensional discrete signals (e.g., proteins or signaling molecules imaging data) 27 . In the future, we envision that ClusterMap can also be extended by combining other types of biological features (e.g., subcellular organelles, cell shapes, etc.) to uncover the basic principles of how gene expression shapes cellular architecture and tissue morphology 28 .

Methods
Data pre-processing

Thin-section STARmap Data Processing
All image processing steps were implemented using MATLAB R2019b and related open-source packages in Python 3.6 according to Wang et al., 2018 6 .

Image Preprocessing
For better unity of the illuminance and contrast level of the fluorescence raw image, a multidimensional histogram matching was performed on each image, which used the image of the first color channel in the first sequencing round as a reference.

Image Registration
Global image registration for aligning spatial position of all amplicons in each round of STARmap imaging was accomplished using a three-dimensional Fast Fourier transform (FFT) to compute the cross-correlation between two image volumes at all translational offsets. The position of the maximal correlation coefficient was identified and used to transform image volumes to compensate for the offset.

Spot Finding
After registration, individual spots were identified separately in each color channel on the first round of sequencing. For this experiment, spots of approximately 6 voxels in diameter were identified by finding local maxima in 3D. After identifying each spot, the dominant color for that spot across all four channels was determined on each round in a 5*5*3 voxel volume surrounding the spot location.

Spots and Barcode Filtering
Spots were first filtered based on fluorescence quality score. Fluorescence quality score is the ratio of targeted single-color channel to all color channels, which quantified the extent to which each spot on each sequencing round came from one color rather than a mixture of colors. Each spot is assigned with a barcode representing a specific kind of gene. The barcode codebook that contains all gene barcodes was converted into color space, based on the expected color sequence following 2-base encoding of the barcode DNA sequence 6 . Spot color sequences that passed the quality threshold and matched sequences in the codebook were kept and identified with the specific gene that that barcode represented; all other spots were rejected. The high-quality spots and associated gene identities in the codebook were then saved out for downstream analysis.

2D Cell Segmentation
Two different methods were used to identify cell boundaries. First, the manually labeled in 2D were then assigned to that cell, to compute a per-cell gene expression matrix.

3D Image Registration
The displacement field of each imaging round was first acquired by registering the DAPI channel of each round to first-round globally by 3D FFT. Each sequencing image was applied with the corresponding transform of its round.

Spot Finding
After registration, individual spots were identified separately in each color channel on each round of sequencing. The extended local maxima in 3D were treated as an amplicon location.
After identifying each spot, the dominant color for that spot across all four channels was determined on each round in a 3*3*3 voxel volume surrounding the spot location.

Computation of Neighborhood Gene Composition (NGC)
To compute the neighborhood gene expression composition of each spot, we considered a spatially circular (2D) or spherical (3D) window over every spot (S) and counted the number of each gene-type among in the window. The raw count of each window was normalized to percentage for downstream analysis. The radius of the window R can be chosen either manually or by statistics to be close to the averaged size of organelles and cells for subcellular and singlecell analysis, respectively.
For a dataset with T kinds of sequenced gene, the definition of an NGC vector to the measured spot i is composed of the number of each gene-type windowed by radius R to the measured spot i.

Density peak clustering (DPC)
Based on the original DPC algorithm 18 , we first computed the two quantities: local density ρ and distance δ of every spot. We estimated the density by a Gaussian kernel with variance -. The varianceis supposed to be close to the averaged radius R of cells for cellular segmentation.
We can use R as -. The definition of local density ρ and distance δ for spot i is: Note that ( ) = 1 < 0, ( ) = 0, and .) is the distance between spot i and j. The optional parameter ,/0 is a striction on the maximum radius of the cell. For the point with the highest density, based on principles of DPC 18 , we took its distance value to the highest δ value.
Note that for large data sets, the analysis is insensitive to the choice ofand results are robust and consistent.
After computing these two quantities of every spot, we generated a multiplication decision graph by computing γ, the product of ρ and δ and plotting every spot's γ value in decreasing order.
Since the cell centers have both high local density and much higher distance at the same time, we chose the points with distinguishably higher γ values as cluster centers. We chose the 'elbow point' as the cutoff point in the multiplication decision graph where its γ value becomes no longer high and its change tends to be flat. T number of clusters N is equal to the number of points prior to the elbow point.
Next, we assigned each remaining point to one of the N clusters respectively in a descending order of ρ value in a single step manner. Each remaining spot was assigned to the same cluster as its nearest cluster-assigned neighbor. Each cluster was regarded as one cell. Finally, we filtered cells by limiting the minimum number of spots and genes expressed in one cell.

Integration of the physical and NGC coordinate
The physical coordinates denote the spatial location of spots and the NGC coordinates denote the gene location of spots in a high-dimensional NGC space. For spot i, its physical and NGC coordinate are: We used inversed Spearman correlation coefficient to measure the distance between two NGCs.
Integration of these two coordinates can be distance-level, clustering-level, and guidedinformation based.

Distance-level integration
We computationally integrated the NGC and physical coordinates into the joint P-NGC coordinate over each spot. Specifically, we combined the physical and NGC distances information between i and its neighboring spots and used the joint distance as the metric to measure relationships between spots. Mathematically, the parameter .) used in the calculation of ρ and δ in DPC is: Then we performed the DPC algorithm and found the cells. This integration method is universal to any datasets. We used distance-level integration for MERFISH mouse POA 3 , pciSeq mouse isocortex 4 , osmFISH mouse SSp 5 , STARmap cardiac organoid 8-gene and STARmap mouse V1 28-gene dataset 6 .

Clustering-level integration
Since data points can be clustered by DPC using physical coordinates and NGC coordinates respectively, we can then do integration on the clustering level. To take these two variables into consideration, joint clustering methods can be explored. To take the correlations between variables into account, we can also optimize a pre-specified objective function. Here we don't apply clustering-level integration to datasets presented in the manuscript.

Guided information-based integration
We first separated spots into clusters with physical coordinates and then corrected the clustering with guided information extracted from the NGC coordinates. To extract the guided information, we identified the neighbors of spot i that were at the distance of R-2R to spot i in the physical space. Then we computed these spots' NGC distances to spot i. If the maximum of the NGC distances from spot k was higher than a threshold, we evaluated if spot k and spot i belong to the same cluster. If so, as they were both distant from spot i in physical and NGC spaces, this indicated the cell which spot i belongs to may be under-clustered. We counted the overall probability of each cell being missed and re-clustered the potentially incorrect cells with more than 50% missing probability. This integration method is best performed with datasets with DAPI stains. We used guided-information based integration for STARmap mouse V1 1020gene 6 and STARmap mouse placenta 903-gene dataset.

Subcellular segmentation
To perform subcellular segmentation and construct nuclear boundaries we first computed the quantity NGC over each spot in an individual cell. The difference between NGC for subcellular segmentation and that for cellular segmentation is the radius of the window R. R should be either chosen manually or by statistics to be close to the averaged size of organelles. In addition, when the number of sequenced genes is limited, we can compute the NGC using a mesh graph by Delaunay triangulation of spots that models the relationship between RNA spots in the cell. A ring of spots that are neighbors of the central spot in the mesh graph is considered to locate most closely around the central cell. For a dataset with TR kinds of gene the definition of an NGC vector to the measured spot i is the composition of gene-types in its closest neighbors: Then, similar to distance-level integration, we generate a joint P-NGC coordinate from the normalized NGC and physical coordinates over each spot: Here the optional parameter λ can control the influence of physical coordinates, depending on conditions. We then used K-means clustering to cluster spots into two regions with one for nucleus and one for cytoplasm. Finally, we constructed a convex hull based on the nucleus spots, denoting the nuclear boundary.

Cell type classification
A two-level clustering strategy was applied to identify both major and sub-level cell types in the dataset. Processing steps in this section were implemented using Scanpy v1. 6 For a dataset with TC kinds of gene, the definition of an NCC vector of the measured cell i was the composition of cell-types in the defined window that had radius RC to the measured cell i.

K-means clustering
Tissue region signatures were identified using information from both NCC and physical locations of cells. Then we generated a joint P-NCC coordinate from normalized NCC and physical coordinates over each cell: Here the optional parameter λ can control the influence of physical coordinates based on conditions. We then used K-means clustering on these high dimensional P-NCC coordinates to cluster cells into a pre-defined number of regions. Finally, we projected spatially back onto the cell-type map.

Compare with expert-annotated labels
We evaluated the accuracy of cell identification by ClusterMap with corresponding eight expert annotated STARmap 6 datasets ( Supplementary Fig. 1c

Integration with scRNA-seq
The cell identification performance was validated by performing a leave-one-out benchmark.
Before integration 20,21 , the scRNA-seq and in situ sequencing data were preprocessed using the Seurat package.
1. Log-normalization: Divide the gene counts for each cell by the total counts for that cell and multiply by the scale. Factor = 10,000. Then perform natural-log transformation using log1p.
2. Scaling the data: Subtract the average expression for each gene and divide the centered gene expression profiles by their standard deviation.
For a shared gene list of scRNA-seq and in situ sequencing data with n genes, one non-repeating gene was left out in each round, and the rest n-1 genes were used for integration with scRNA-seq data and then the prediction of the left-out gene's expression profile. The integration and prediction steps were performed using FindTransferAnchors and TransferData functions in Seurat, which identified anchors between the reference (scRNA-seq) and query (in situ sequencing) dataset in reduced dimensions (reduction = 'cca') using mutual nearest neighbors and used these anchors to predict the left-out gene expression.
Next, the Pearson correlation of measured and predicted profile was calculated as the benchmark metrics. Finally, we compared the correlation between ClusterMap or manual annotation with scRNA-seq and gave quantitative analyses using violin plot, which showed the distribution of correlation for different annotation methods, and scatter plot, which represented the correlation values of these two methods for each gene.

Label transfer
Cell type labels from scRNA-Seq dataset were projected onto spatially resolved cells from STARmap dataset by using the Seurat v3 integration method according to Stuart et al. 2019 20 .
First, both datasets were preprocessed (normalization & scaling) and a subset of features (e.g., genes) exhibiting high variability was extracted. For STARmap dataset, all genes profiled were used whereas in scRNA-Seq dataset, the top 2,000 most variable genes identified by "FindVariableFeatures" function were used in downstream integration. Then "FindTransferAnchors" (reduction = "cca") and Transfer Data functions were used to map the labels onto spatially resolved cells from the STARmap dataset. After label transferring, 6,672 out of 7,224 cells were observed with high-confidence cell type predictions (prediction score > 0.5), 8 out of 10 cell types labels were resolved.

Data availability
The STARmap mouse V1 1020-gene and 160-gene sets 6 , MERFISH mouse POA set 3 , pciSeq mouse isocortex set 4 , osmFISH mouse SSp set 5 , and STARmap mouse V1 28-gene set 6 are available as supplementary information file of the orginal manuscript. The data that support the findings of this study are available from the corresponding author upon reasonable request.  The colors correspond to the cell type legend in (c). Cell types in (f, h) are color-coded as in (e).