Overview
CytoSPACER implements the CytoSPACE algorithm (Vahid et al., Nature Biotechnology, 2023), which formulates the problem of mapping single cells to spatial locations as a Linear Assignment Problem (LAP). This vignette provides a detailed explanation of the underlying mathematical principles.
The Spatial Mapping Problem
Problem Statement
Given:
- scRNA-seq data: A reference single-cell dataset with cells and known cell type annotations
- Spatial transcriptomics data: spatial spots with gene expression profiles
- Estimated cell counts: The number of cells expected at each spot
Goal: Find the optimal assignment of single cells to spatial spots that minimizes the total expression dissimilarity.
Algorithm Pipeline
CytoSPACER proceeds through five main steps:
Step 1: Cell Type Fraction Estimation
First, we estimate the cell type composition at each spatial spot using reference-based deconvolution:
# Estimate cell type fractions using Seurat's TransferData
fractions <- estimate_fractions(
sc_data = sc_expression,
cell_types = cell_labels,
st_data = st_expression
)This step uses anchor-based integration to transfer cell type labels from the scRNA-seq reference to the spatial data.
Step 2: Cell Count Estimation
The number of cells per spot is estimated based on total RNA content:
# Simulate ST data for demonstration
set.seed(42)
st_demo <- matrix(rpois(100 * 20, lambda = 100), nrow = 100, ncol = 20)
rownames(st_demo) <- paste0("Gene", 1:100)
colnames(st_demo) <- paste0("Spot", 1:20)
# Estimate cells per spot
cells_per_spot <- estimate_cells_per_spot(
st_data = st_demo,
mean_cells = 5,
min_cells = 1
)
# View distribution
summary(cells_per_spot)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1 3 5 5 7 8The estimation assumes a linear relationship between total normalized expression and cell count, anchored at the minimum and mean values.
Step 3: Reference Cell Sampling
Single cells are sampled from the scRNA-seq reference to match the estimated spatial composition:
# Create demo scRNA-seq data
sc_demo <- matrix(rpois(100 * 200, lambda = 10), nrow = 100, ncol = 200)
rownames(sc_demo) <- paste0("Gene", 1:100)
colnames(sc_demo) <- paste0("Cell", 1:200)
cell_types_demo <- rep(c("TypeA", "TypeB", "TypeC", "TypeD"), each = 50)
names(cell_types_demo) <- colnames(sc_demo)
# Define target counts
target_counts <- c(TypeA = 30, TypeB = 40, TypeC = 20, TypeD = 10)
# Sample cells
sampled <- sample_cells(
sc_data = sc_demo,
cell_types = cell_types_demo,
target_counts = target_counts,
method = "duplicates",
seed = 42
)
cat("Sampled", ncol(sampled$expression), "cells\n")
#> Sampled 100 cells
table(sampled$cell_types)
#>
#> TypeA TypeB TypeC TypeD
#> 30 40 20 10Two sampling methods are available:
- duplicates: Allows reusing cells when the reference is insufficient
- synthetic: Generates synthetic cells by sampling gene expression values
Step 4: Cost Matrix Construction
The cost matrix quantifies dissimilarity between each cell-spot pair:
# Normalize data
sc_norm <- normalize_expression(sc_demo[, 1:50])
st_norm <- normalize_expression(st_demo)
# Compute cost matrix using Pearson correlation
cost <- compute_cost_matrix(
sc_data = sc_norm,
st_data = st_norm,
method = "pearson"
)
cat("Cost matrix dimensions:", nrow(cost), "x", ncol(cost), "\n")
#> Cost matrix dimensions: 50 x 20
cat("Cost range:", round(min(cost), 3), "to", round(max(cost), 3), "\n")
#> Cost range: -0.337 to 0.281For correlation-based methods, the cost is computed as:
where is the Pearson or Spearman correlation coefficient.
Step 5: Linear Assignment Problem Solving
The core optimization uses the Jonker-Volgenant algorithm, an efficient method for solving the LAP:
# Create a simple cost matrix
cost_simple <- matrix(c(
1, 2, 3,
2, 4, 6,
3, 6, 9
), nrow = 3, byrow = TRUE)
# Solve LAP
assignment <- solve_lap(cost_simple)
cat("Optimal assignment:", assignment, "\n")
#> Optimal assignment: 3 1 2
# Compute total cost
total_cost <- compute_assignment_cost(cost_simple, assignment)
cat("Total cost:", total_cost, "\n")
#> Total cost: 11The Jonker-Volgenant Algorithm
The Jonker-Volgenant algorithm (1987) is a shortest augmenting path algorithm that efficiently solves the LAP. Key features:
- Column Reduction: Initialize dual variables by column minima
- Reduction Transfer: Update dual variables for assigned rows
- Augmenting Row Reduction: Find initial feasible assignments
- Augmentation: Use Dijkstra-like shortest path to find augmenting paths
Implementation Details
CytoSPACER provides a high-performance C++ implementation:
# The C++ implementation handles both square and rectangular matrices
# Square matrix
cost_square <- matrix(runif(100), nrow = 10, ncol = 10)
assignment_square <- solve_lap(cost_square)
# Rectangular matrix (rows <= cols)
cost_rect <- matrix(runif(50), nrow = 5, ncol = 10)
assignment_rect <- solve_lap(cost_rect)
cat("Square assignment:", assignment_square, "\n")
#> Square assignment: 1 6 8 7 4 10 2 9 3 5
cat("Rectangular assignment:", assignment_rect, "\n")
#> Rectangular assignment: 9 1 7 3 5Distance Metrics
CytoSPACER supports three distance metrics:
Pearson Correlation
cost_pearson <- compute_cost_matrix(sc_norm[,1:10], st_norm[,1:5], method = "pearson")
cat("Pearson cost range:", round(range(cost_pearson), 3), "\n")
#> Pearson cost range: -0.183 0.229Spearman Correlation
Uses rank-transformed data:
cost_spearman <- compute_cost_matrix(sc_norm[,1:10], st_norm[,1:5], method = "spearman")
cat("Spearman cost range:", round(range(cost_spearman), 3), "\n")
#> Spearman cost range: -0.232 0.194Euclidean Distance
cost_euclidean <- compute_cost_matrix(sc_norm[,1:10], st_norm[,1:5], method = "euclidean")
cat("Euclidean cost range:", round(range(cost_euclidean), 3), "\n")
#> Euclidean cost range: 4.412 6.788Normalization
Expression data is normalized using TPM + log2 transformation:
# Raw counts
raw_counts <- matrix(rpois(100, lambda = 50), nrow = 10, ncol = 10)
# Normalize
normalized <- normalize_expression(raw_counts)
cat("Raw column sums:", head(colSums(raw_counts)), "\n")
#> Raw column sums: 490 520 501 498 473 492
cat("Normalized column sums:", round(head(colSums(normalized)), 2), "\n")
#> Normalized column sums: 165.97 165.99 165.91 166 165.9 165.97Handling Large Datasets
For large datasets, CytoSPACER uses chunked processing:
-
Partitioning: Divide cells into chunks based on
chunk_size - Parallel Solving: Solve subproblems independently
- Aggregation: Combine results from all chunks
# Parallel processing with multiple workers
results <- run_cytospace(
sc_data = sc_data,
cell_types = cell_types,
st_data = st_data,
coordinates = coordinates,
chunk_size = 5000,
n_workers = 4
)Tie-Breaking
To ensure deterministic results when multiple optimal solutions exist, small random noise is added to the cost matrix:
cost_with_noise <- add_noise_to_cost(cost_simple, scale = 1e-10, seed = 42)
cat("Original cost:\n")
#> Original cost:
print(cost_simple)
#> [,1] [,2] [,3]
#> [1,] 1 2 3
#> [2,] 2 4 6
#> [3,] 3 6 9
cat("\nWith noise (difference):\n")
#>
#> With noise (difference):
print(round(cost_with_noise - cost_simple, 12))
#> [,1] [,2] [,3]
#> [1,] 9.1e-11 8.3e-11 7.4e-11
#> [2,] 9.4e-11 6.4e-11 1.3e-11
#> [3,] 2.9e-11 5.2e-11 6.6e-11References
Vahid MR, et al. (2023). High-resolution alignment of single-cell and spatial transcriptomes with CytoSPACE. Nature Biotechnology 41, 1543–1548.
Jonker R, Volgenant A. (1987). A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38(4):325-340.
Kuhn HW. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1-2):83-97.
Session Info
sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C
#>
#> time zone: Asia/Shanghai
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] CytoSPACER_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] Matrix_1.7-4 gtable_0.3.6 future.apply_1.20.1
#> [4] jsonlite_2.0.0 dplyr_1.1.4 compiler_4.4.0
#> [7] Rcpp_1.1.1 tidyselect_1.2.1 parallel_4.4.0
#> [10] dichromat_2.0-0.1 jquerylib_0.1.4 globals_0.18.0
#> [13] systemfonts_1.3.1 scales_1.4.0 textshaping_1.0.4
#> [16] yaml_2.3.12 fastmap_1.2.0 lattice_0.22-7
#> [19] ggplot2_4.0.1 R6_2.6.1 generics_0.1.4
#> [22] knitr_1.51 htmlwidgets_1.6.4 tibble_3.3.1
#> [25] future_1.69.0 desc_1.4.3 pillar_1.11.1
#> [28] bslib_0.9.0 RColorBrewer_1.1-3 rlang_1.1.7
#> [31] cachem_1.1.0 xfun_0.56 S7_0.2.1
#> [34] fs_1.6.6 sass_0.4.10 otel_0.2.0
#> [37] cli_3.6.5 progressr_0.18.0 magrittr_2.0.4
#> [40] pkgdown_2.1.3 digest_0.6.39 grid_4.4.0
#> [43] lifecycle_1.0.5 vctrs_0.7.1 evaluate_1.0.5
#> [46] glue_1.8.0 data.table_1.18.0 farver_2.1.2
#> [49] listenv_0.10.0 codetools_0.2-20 ragg_1.5.0
#> [52] parallelly_1.46.1 rmarkdown_2.30 pkgconfig_2.0.3
#> [55] tools_4.4.0 htmltools_0.5.9