Algorithm and Mathematical Framework
Zaoqu Liu
2026-01-24
Source:vignettes/algorithm.Rmd
algorithm.RmdOverview
scGate implements a marker-based cell type purification algorithm that combines:
- UCell scoring for robust signature quantification
- k-Nearest Neighbor (kNN) smoothing for noise reduction
- Hierarchical decision trees for multi-level gating
This document provides a detailed explanation of the mathematical framework behind scGate.
Algorithm Pipeline
Step 1: Signature Scoring with UCell
scGate uses the UCell algorithm for computing signature scores. UCell is a rank-based method that is robust to technical variation and batch effects.
Mathematical Formulation
For a cell and a gene signature :
- Rank genes by expression in cell : is the rank of gene
- Compute UCell score:
Where is the maximum rank considered (default: 1500).
Key Properties
- Range: UCell scores range from 0 to 1
- Robustness: Rank-based scoring is robust to outliers
- Interpretability: Higher scores indicate stronger signature expression
# Load example data
data(query.seurat)
# Create a simple model
model <- gating_model(name = "Tcell", signature = c("CD3D", "CD3E", "CD2"))
# Apply scGate (this computes UCell scores internally)
query.seurat <- scGate(query.seurat, model = model, reduction = "pca")
# Visualize UCell scores
p1 <- FeaturePlot(query.seurat, features = "Tcell_UCell",
cols = c("gray90", "darkblue")) +
ggtitle("T cell Signature Score (UCell)")
p2 <- DimPlot(query.seurat, group.by = "is.pure",
cols = c("Pure" = "#00ae60", "Impure" = "gray80")) +
ggtitle("scGate Classification")
p1 + p2Step 2: kNN Smoothing
Single-cell data is inherently sparse. scGate applies kNN smoothing to reduce noise and improve classification accuracy.
Mathematical Formulation
For each cell , let be its k-nearest neighbors in the reduced dimensional space. The smoothed score is:
Where the weights are computed using exponential decay:
- is the decay parameter (default: 0.1)
- is the rank of neighbor (1 for nearest, k for furthest)
Effect of Smoothing
# Compare raw vs smoothed scores
# The smoothed scores are already computed by scGate
# Create histogram of scores
score_data <- data.frame(
score = query.seurat$Tcell_UCell,
classification = query.seurat$is.pure
)
p1 <- ggplot(score_data, aes(x = score, fill = classification)) +
geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
scale_fill_manual(values = c("Pure" = "#00ae60", "Impure" = "gray60")) +
labs(title = "Distribution of T cell Scores",
x = "UCell Score (kNN-smoothed)",
y = "Count") +
theme_minimal() +
theme(legend.position = "bottom")
p2 <- ggplot(score_data, aes(x = classification, y = score, fill = classification)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.2, fill = "white", alpha = 0.8) +
scale_fill_manual(values = c("Pure" = "#00ae60", "Impure" = "gray60")) +
labs(title = "Score Distribution by Class",
x = "Classification",
y = "UCell Score") +
theme_minimal() +
theme(legend.position = "none")
p1 + p2Step 3: Hierarchical Decision Trees
scGate models can have multiple levels, forming a hierarchical decision tree similar to flow cytometry gating strategies.
Gating Logic
At each level , cells are classified based on:
- Positive signatures : Must exceed threshold
- Negative signatures : Must be below threshold
A cell passes level if:
Parameter Decay
For multi-level models, parameters decay at each level to handle progressively smaller cell populations:
Where: - is the initial number of neighbors - is the initial number of features - is the decay parameter (default: 0.25)
# Create a hierarchical model
hierarchical_model <- gating_model(
level = 1,
name = "Immune",
signature = c("PTPRC") # CD45
)
hierarchical_model <- gating_model(
model = hierarchical_model,
level = 2,
name = "Tcell",
signature = c("CD3D", "CD3E")
)
# View model structure
print(hierarchical_model)
#> levels use_as name signature
#> 1 level1 positive Immune PTPRC
#> 2 level2 positive Tcell CD3D;CD3EPerformance Metrics
scGate provides functions to evaluate classification performance.
Matthews Correlation Coefficient (MCC)
MCC is a balanced measure that works well even with imbalanced classes:
Where: - TP = True Positives - TN = True Negatives
- FP = False Positives - FN = False Negatives
Numerical Stability
scGate implements numerically stable calculations:
# Simulate classification results
set.seed(42)
actual <- sample(c(0, 1), 100, replace = TRUE, prob = c(0.7, 0.3))
predicted <- actual
# Add some noise
predicted[sample(which(actual == 1), 5)] <- 0
predicted[sample(which(actual == 0), 3)] <- 1
# Calculate performance metrics
metrics <- performance.metrics(actual, predicted)
print(metrics)
#> PREC REC MCC
#> 0.9062500 0.8529412 0.8200066Computational Considerations
Parallel Processing
For multi-model classification, scGate supports parallel processing:
library(BiocParallel)
# Use multiple cores
result <- scGate(
data = seurat_obj,
model = model_list,
ncores = 4 # Use 4 cores
)
# Or specify custom BPPARAM
result <- scGate(
data = seurat_obj,
model = model_list,
BPPARAM = MulticoreParam(workers = 4)
)Summary
The scGate algorithm combines three key components:
| Component | Purpose | Key Parameter |
|---|---|---|
| UCell scoring | Robust signature quantification | maxRank |
| kNN smoothing | Noise reduction |
k.param, smooth.decay
|
| Hierarchical gating | Multi-level classification | param_decay |
This combination enables accurate, interpretable cell type purification without requiring reference datasets.
References
Andreatta M, Berenstein AJ, Carmona SJ. scGate: marker-based purification of cell types from heterogeneous single-cell RNA-seq datasets. Bioinformatics. 2022;38(9):2642-2644.
Andreatta M, Carmona SJ. UCell: Robust and scalable single-cell gene signature scoring. Computational and Structural Biotechnology Journal. 2021;19:3796-3798.
Session Info
sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C
#>
#> time zone: Asia/Shanghai
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] patchwork_1.3.2 ggplot2_4.0.1 SeuratObject_4.1.4 Seurat_4.4.0
#> [5] scGate_1.7.2
#>
#> loaded via a namespace (and not attached):
#> [1] deldir_2.0-4 pbapply_1.7-4 gridExtra_2.3
#> [4] rlang_1.1.7 magrittr_2.0.4 RcppAnnoy_0.0.23
#> [7] otel_0.2.0 spatstat.geom_3.7-0 matrixStats_1.5.0
#> [10] ggridges_0.5.7 compiler_4.4.0 png_0.1-8
#> [13] systemfonts_1.3.1 vctrs_0.7.1 reshape2_1.4.5
#> [16] stringr_1.6.0 pkgconfig_2.0.3 fastmap_1.2.0
#> [19] promises_1.5.0 rmarkdown_2.30 ragg_1.5.0
#> [22] purrr_1.2.1 xfun_0.56 cachem_1.1.0
#> [25] jsonlite_2.0.0 goftest_1.2-3 later_1.4.5
#> [28] BiocParallel_1.40.2 spatstat.utils_3.2-1 irlba_2.3.5.1
#> [31] parallel_4.4.0 cluster_2.1.8.1 R6_2.6.1
#> [34] ica_1.0-3 spatstat.data_3.1-9 stringi_1.8.7
#> [37] bslib_0.9.0 RColorBrewer_1.1-3 reticulate_1.44.1
#> [40] spatstat.univar_3.1-6 parallelly_1.46.1 lmtest_0.9-40
#> [43] jquerylib_0.1.4 scattermore_1.2 Rcpp_1.1.1
#> [46] knitr_1.51 tensor_1.5.1 future.apply_1.20.1
#> [49] zoo_1.8-15 sctransform_0.4.3 httpuv_1.6.16
#> [52] Matrix_1.7-4 splines_4.4.0 igraph_2.2.1
#> [55] tidyselect_1.2.1 abind_1.4-8 dichromat_2.0-0.1
#> [58] yaml_2.3.12 spatstat.random_3.4-4 spatstat.explore_3.7-0
#> [61] codetools_0.2-20 miniUI_0.1.2 listenv_0.10.0
#> [64] plyr_1.8.9 lattice_0.22-7 tibble_3.3.1
#> [67] withr_3.0.2 shiny_1.12.1 S7_0.2.1
#> [70] ROCR_1.0-12 evaluate_1.0.5 Rtsne_0.17
#> [73] future_1.69.0 desc_1.4.3 survival_3.8-3
#> [76] polyclip_1.10-7 fitdistrplus_1.2-5 pillar_1.11.1
#> [79] KernSmooth_2.23-26 plotly_4.11.0 generics_0.1.4
#> [82] sp_2.2-0 scales_1.4.0 globals_0.18.0
#> [85] xtable_1.8-4 glue_1.8.0 lazyeval_0.2.2
#> [88] tools_4.4.0 BiocNeighbors_2.0.1 data.table_1.18.0
#> [91] RANN_2.6.2 dotCall64_1.2 fs_1.6.6
#> [94] leiden_0.4.3.1 cowplot_1.2.0 grid_4.4.0
#> [97] tidyr_1.3.2 colorspace_2.1-2 nlme_3.1-168
#> [100] cli_3.6.5 spatstat.sparse_3.1-0 textshaping_1.0.4
#> [103] spam_2.11-3 viridisLite_0.4.2 dplyr_1.1.4
#> [106] uwot_0.2.4 gtable_0.3.6 sass_0.4.10
#> [109] digest_0.6.39 progressr_0.18.0 ggrepel_0.9.6
#> [112] htmlwidgets_1.6.4 farver_2.1.2 htmltools_0.5.9
#> [115] pkgdown_2.1.3 lifecycle_1.0.5 httr_1.4.7
#> [118] mime_0.13 MASS_7.3-65