Algorithm Principles

Overview

CytoSPACER implements the CytoSPACE algorithm (Vahid et al., Nature Biotechnology, 2023), which formulates the problem of mapping single cells to spatial locations as a Linear Assignment Problem (LAP). This vignette provides a detailed explanation of the underlying mathematical principles.

The Spatial Mapping Problem

Problem Statement

Given:

scRNA-seq data: A reference single-cell dataset with $n$ cells and known cell type annotations
Spatial transcriptomics data: $m$ spatial spots with gene expression profiles
Estimated cell counts: The number of cells expected at each spot

Goal: Find the optimal assignment of single cells to spatial spots that minimizes the total expression dissimilarity.

Mathematical Formulation

Let:

$C \in \mathbb{R}^{n \times m}$ be the cost matrix where $C_{ij}$ represents the dissimilarity between cell $i$ and spot $j$
$X \in \{0, 1\}^{n \times m}$ be the assignment matrix where $X_{ij} = 1$ if cell $i$ is assigned to spot $j$

The optimization problem is:

$\min_{X} \sum_{i=1}^{n} \sum_{j=1}^{m} C_{ij} X_{ij}$

Subject to:

$\sum_{j=1}^{m} X_{ij} = 1 \quad \forall i \quad \text{(each cell assigned to exactly one spot)}$

$\sum_{i=1}^{n} X_{ij} = k_j \quad \forall j \quad \text{(each spot receives } k_j \text{ cells)}$

Algorithm Pipeline

CytoSPACER proceeds through five main steps:

library(CytoSPACER)

Step 1: Cell Type Fraction Estimation

First, we estimate the cell type composition at each spatial spot using reference-based deconvolution:

# Estimate cell type fractions using Seurat's TransferData
fractions <- estimate_fractions(
  sc_data = sc_expression,
  cell_types = cell_labels,
  st_data = st_expression
)

This step uses anchor-based integration to transfer cell type labels from the scRNA-seq reference to the spatial data.

Step 2: Cell Count Estimation

The number of cells per spot is estimated based on total RNA content:

# Simulate ST data for demonstration
set.seed(42)
st_demo <- matrix(rpois(100 * 20, lambda = 100), nrow = 100, ncol = 20)
rownames(st_demo) <- paste0("Gene", 1:100)
colnames(st_demo) <- paste0("Spot", 1:20)

# Estimate cells per spot
cells_per_spot <- estimate_cells_per_spot(
  st_data = st_demo,
  mean_cells = 5,
  min_cells = 1
)

# View distribution
summary(cells_per_spot)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       1       3       5       5       7       8

The estimation assumes a linear relationship between total normalized expression and cell count, anchored at the minimum and mean values.

Step 3: Reference Cell Sampling

Single cells are sampled from the scRNA-seq reference to match the estimated spatial composition:

# Create demo scRNA-seq data
sc_demo <- matrix(rpois(100 * 200, lambda = 10), nrow = 100, ncol = 200)
rownames(sc_demo) <- paste0("Gene", 1:100)
colnames(sc_demo) <- paste0("Cell", 1:200)

cell_types_demo <- rep(c("TypeA", "TypeB", "TypeC", "TypeD"), each = 50)
names(cell_types_demo) <- colnames(sc_demo)

# Define target counts
target_counts <- c(TypeA = 30, TypeB = 40, TypeC = 20, TypeD = 10)

# Sample cells
sampled <- sample_cells(
  sc_data = sc_demo,
  cell_types = cell_types_demo,
  target_counts = target_counts,
  method = "duplicates",
  seed = 42
)

cat("Sampled", ncol(sampled$expression), "cells\n")
#> Sampled 100 cells
table(sampled$cell_types)
#> 
#> TypeA TypeB TypeC TypeD 
#>    30    40    20    10

Two sampling methods are available:

duplicates: Allows reusing cells when the reference is insufficient
synthetic: Generates synthetic cells by sampling gene expression values

Step 4: Cost Matrix Construction

The cost matrix quantifies dissimilarity between each cell-spot pair:

# Normalize data
sc_norm <- normalize_expression(sc_demo[, 1:50])
st_norm <- normalize_expression(st_demo)

# Compute cost matrix using Pearson correlation
cost <- compute_cost_matrix(
  sc_data = sc_norm,
  st_data = st_norm,
  method = "pearson"
)

cat("Cost matrix dimensions:", nrow(cost), "x", ncol(cost), "\n")
#> Cost matrix dimensions: 50 x 20
cat("Cost range:", round(min(cost), 3), "to", round(max(cost), 3), "\n")
#> Cost range: -0.337 to 0.281

For correlation-based methods, the cost is computed as:

$C_{ij} = 1 - \rho(x_i, y_j)$

where $\rho$ is the Pearson or Spearman correlation coefficient.

Step 5: Linear Assignment Problem Solving

The core optimization uses the Jonker-Volgenant algorithm, an efficient $O(n^3)$ method for solving the LAP:

# Create a simple cost matrix
cost_simple <- matrix(c(
  1, 2, 3,
  2, 4, 6,
  3, 6, 9
), nrow = 3, byrow = TRUE)

# Solve LAP
assignment <- solve_lap(cost_simple)
cat("Optimal assignment:", assignment, "\n")
#> Optimal assignment: 3 1 2

# Compute total cost
total_cost <- compute_assignment_cost(cost_simple, assignment)
cat("Total cost:", total_cost, "\n")
#> Total cost: 11

The Jonker-Volgenant Algorithm

The Jonker-Volgenant algorithm (1987) is a shortest augmenting path algorithm that efficiently solves the LAP. Key features:

Column Reduction: Initialize dual variables by column minima
Reduction Transfer: Update dual variables for assigned rows
Augmenting Row Reduction: Find initial feasible assignments
Augmentation: Use Dijkstra-like shortest path to find augmenting paths

Implementation Details

CytoSPACER provides a high-performance C++ implementation:

# The C++ implementation handles both square and rectangular matrices
# Square matrix
cost_square <- matrix(runif(100), nrow = 10, ncol = 10)
assignment_square <- solve_lap(cost_square)

# Rectangular matrix (rows <= cols)
cost_rect <- matrix(runif(50), nrow = 5, ncol = 10)
assignment_rect <- solve_lap(cost_rect)

cat("Square assignment:", assignment_square, "\n")
#> Square assignment: 1 6 8 7 4 10 2 9 3 5
cat("Rectangular assignment:", assignment_rect, "\n")
#> Rectangular assignment: 9 1 7 3 5

Distance Metrics

CytoSPACER supports three distance metrics:

Pearson Correlation

$\rho_{Pearson}(x, y) = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2} \sqrt{\sum_i (y_i - \bar{y})^2}}$

cost_pearson <- compute_cost_matrix(sc_norm[,1:10], st_norm[,1:5], method = "pearson")
cat("Pearson cost range:", round(range(cost_pearson), 3), "\n")
#> Pearson cost range: -0.183 0.229

Spearman Correlation

Uses rank-transformed data:

$\rho_{Spearman}(x, y) = \rho_{Pearson}(rank(x), rank(y))$

cost_spearman <- compute_cost_matrix(sc_norm[,1:10], st_norm[,1:5], method = "spearman")
cat("Spearman cost range:", round(range(cost_spearman), 3), "\n")
#> Spearman cost range: -0.232 0.194

Euclidean Distance

$d_{Euclidean}(x, y) = \sqrt{\sum_i (x_i - y_i)^2}$

cost_euclidean <- compute_cost_matrix(sc_norm[,1:10], st_norm[,1:5], method = "euclidean")
cat("Euclidean cost range:", round(range(cost_euclidean), 3), "\n")
#> Euclidean cost range: 4.412 6.788

Normalization

Expression data is normalized using TPM + log2 transformation:

$x'_{ij} = \log_2\left(\frac{x_{ij}}{\sum_i x_{ij}} \times 10^6 + 1\right)$

# Raw counts
raw_counts <- matrix(rpois(100, lambda = 50), nrow = 10, ncol = 10)

# Normalize
normalized <- normalize_expression(raw_counts)

cat("Raw column sums:", head(colSums(raw_counts)), "\n")
#> Raw column sums: 490 520 501 498 473 492
cat("Normalized column sums:", round(head(colSums(normalized)), 2), "\n")
#> Normalized column sums: 165.97 165.99 165.91 166 165.9 165.97

Handling Large Datasets

For large datasets, CytoSPACER uses chunked processing:

Partitioning: Divide cells into chunks based on chunk_size
Parallel Solving: Solve subproblems independently
Aggregation: Combine results from all chunks

# Parallel processing with multiple workers
results <- run_cytospace(
  sc_data = sc_data,
  cell_types = cell_types,
  st_data = st_data,
  coordinates = coordinates,
  chunk_size = 5000,
  n_workers = 4
)

Tie-Breaking

To ensure deterministic results when multiple optimal solutions exist, small random noise is added to the cost matrix:

cost_with_noise <- add_noise_to_cost(cost_simple, scale = 1e-10, seed = 42)
cat("Original cost:\n")
#> Original cost:
print(cost_simple)
#>      [,1] [,2] [,3]
#> [1,]    1    2    3
#> [2,]    2    4    6
#> [3,]    3    6    9
cat("\nWith noise (difference):\n")
#> 
#> With noise (difference):
print(round(cost_with_noise - cost_simple, 12))
#>         [,1]    [,2]    [,3]
#> [1,] 9.1e-11 8.3e-11 7.4e-11
#> [2,] 9.4e-11 6.4e-11 1.3e-11
#> [3,] 2.9e-11 5.2e-11 6.6e-11

References

Vahid MR, et al. (2023). High-resolution alignment of single-cell and spatial transcriptomes with CytoSPACE. Nature Biotechnology 41, 1543–1548.
Jonker R, Volgenant A. (1987). A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38(4):325-340.
Kuhn HW. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1-2):83-97.

Session Info

sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] CytoSPACER_1.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Matrix_1.7-4        gtable_0.3.6        future.apply_1.20.1
#>  [4] jsonlite_2.0.0      dplyr_1.1.4         compiler_4.4.0     
#>  [7] Rcpp_1.1.1          tidyselect_1.2.1    parallel_4.4.0     
#> [10] dichromat_2.0-0.1   jquerylib_0.1.4     globals_0.18.0     
#> [13] systemfonts_1.3.1   scales_1.4.0        textshaping_1.0.4  
#> [16] yaml_2.3.12         fastmap_1.2.0       lattice_0.22-7     
#> [19] ggplot2_4.0.1       R6_2.6.1            generics_0.1.4     
#> [22] knitr_1.51          htmlwidgets_1.6.4   tibble_3.3.1       
#> [25] future_1.69.0       desc_1.4.3          pillar_1.11.1      
#> [28] bslib_0.9.0         RColorBrewer_1.1-3  rlang_1.1.7        
#> [31] cachem_1.1.0        xfun_0.56           S7_0.2.1           
#> [34] fs_1.6.6            sass_0.4.10         otel_0.2.0         
#> [37] cli_3.6.5           progressr_0.18.0    magrittr_2.0.4     
#> [40] pkgdown_2.1.3       digest_0.6.39       grid_4.4.0         
#> [43] lifecycle_1.0.5     vctrs_0.7.1         evaluate_1.0.5     
#> [46] glue_1.8.0          data.table_1.18.0   farver_2.1.2       
#> [49] listenv_0.10.0      codetools_0.2-20    ragg_1.5.0         
#> [52] parallelly_1.46.1   rmarkdown_2.30      pkgconfig_2.0.3    
#> [55] tools_4.4.0         htmltools_0.5.9

Zaoqu Liu

2026-01-25