Skip to contents

Overview

scGate implements a marker-based cell type purification algorithm that combines:

  1. UCell scoring for robust signature quantification
  2. k-Nearest Neighbor (kNN) smoothing for noise reduction
  3. Hierarchical decision trees for multi-level gating

This document provides a detailed explanation of the mathematical framework behind scGate.

Algorithm Pipeline

Step 1: Signature Scoring with UCell

scGate uses the UCell algorithm for computing signature scores. UCell is a rank-based method that is robust to technical variation and batch effects.

Mathematical Formulation

For a cell cc and a gene signature S={g1,g2,...,gn}S = \{g_1, g_2, ..., g_n\}:

  1. Rank genes by expression in cell cc: Rc(g)R_c(g) is the rank of gene gg
  2. Compute UCell score:

Uc(S)=1gSmin(Rc(g),Rmax)|S|RmaxU_c(S) = 1 - \frac{\sum_{g \in S} \min(R_c(g), R_{max})}{|S| \cdot R_{max}}

Where RmaxR_{max} is the maximum rank considered (default: 1500).

Key Properties

  • Range: UCell scores range from 0 to 1
  • Robustness: Rank-based scoring is robust to outliers
  • Interpretability: Higher scores indicate stronger signature expression
# Load example data
data(query.seurat)

# Create a simple model
model <- gating_model(name = "Tcell", signature = c("CD3D", "CD3E", "CD2"))

# Apply scGate (this computes UCell scores internally)
query.seurat <- scGate(query.seurat, model = model, reduction = "pca")

# Visualize UCell scores
p1 <- FeaturePlot(query.seurat, features = "Tcell_UCell", 
                  cols = c("gray90", "darkblue")) +
  ggtitle("T cell Signature Score (UCell)")

p2 <- DimPlot(query.seurat, group.by = "is.pure",
              cols = c("Pure" = "#00ae60", "Impure" = "gray80")) +
  ggtitle("scGate Classification")

p1 + p2

Step 2: kNN Smoothing

Single-cell data is inherently sparse. scGate applies kNN smoothing to reduce noise and improve classification accuracy.

Mathematical Formulation

For each cell cc, let Nk(c)N_k(c) be its k-nearest neighbors in the reduced dimensional space. The smoothed score is:

Ũc(S)=cNk(c)wccUc(S)\tilde{U}_c(S) = \sum_{c' \in N_k(c)} w_{cc'} \cdot U_{c'}(S)

Where the weights wccw_{cc'} are computed using exponential decay:

wcc=(1α)d(c,c)w_{cc'} = (1 - \alpha)^{d(c, c')}

  • α\alpha is the decay parameter (default: 0.1)
  • d(c,c)d(c, c') is the rank of neighbor cc' (1 for nearest, k for furthest)

Effect of Smoothing

# Compare raw vs smoothed scores
# The smoothed scores are already computed by scGate

# Create histogram of scores
score_data <- data.frame(
  score = query.seurat$Tcell_UCell,
  classification = query.seurat$is.pure
)

p1 <- ggplot(score_data, aes(x = score, fill = classification)) +
  geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
  scale_fill_manual(values = c("Pure" = "#00ae60", "Impure" = "gray60")) +
  labs(title = "Distribution of T cell Scores",
       x = "UCell Score (kNN-smoothed)",
       y = "Count") +
  theme_minimal() +
  theme(legend.position = "bottom")

p2 <- ggplot(score_data, aes(x = classification, y = score, fill = classification)) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.2, fill = "white", alpha = 0.8) +
  scale_fill_manual(values = c("Pure" = "#00ae60", "Impure" = "gray60")) +
  labs(title = "Score Distribution by Class",
       x = "Classification",
       y = "UCell Score") +
  theme_minimal() +
  theme(legend.position = "none")

p1 + p2

Step 3: Hierarchical Decision Trees

scGate models can have multiple levels, forming a hierarchical decision tree similar to flow cytometry gating strategies.

Gating Logic

At each level ll, cells are classified based on:

  1. Positive signatures S+S^+: Must exceed threshold θ+\theta^+
  2. Negative signatures SS^-: Must be below threshold θ\theta^-

A cell passes level ll if:

maxsS+Ũc(s)>θ+ANDmaxsSŨc(s)<θ\max_{s \in S^+} \tilde{U}_c(s) > \theta^+ \quad \text{AND} \quad \max_{s \in S^-} \tilde{U}_c(s) < \theta^-

Parameter Decay

For multi-level models, parameters decay at each level to handle progressively smaller cell populations:

kl=k0(1δ)l1k_l = k_0 \cdot (1 - \delta)^{l-1}nl=n0(1δ)l1n_l = n_0 \cdot (1 - \delta)^{l-1}

Where: - k0k_0 is the initial number of neighbors - n0n_0 is the initial number of features - δ\delta is the decay parameter (default: 0.25)

# Create a hierarchical model
hierarchical_model <- gating_model(
  level = 1, 
  name = "Immune", 
  signature = c("PTPRC")  # CD45
)
hierarchical_model <- gating_model(
  model = hierarchical_model,
  level = 2,
  name = "Tcell",
  signature = c("CD3D", "CD3E")
)

# View model structure
print(hierarchical_model)
#>   levels   use_as   name signature
#> 1 level1 positive Immune     PTPRC
#> 2 level2 positive  Tcell CD3D;CD3E

Performance Metrics

scGate provides functions to evaluate classification performance.

Matthews Correlation Coefficient (MCC)

MCC is a balanced measure that works well even with imbalanced classes:

MCC=TPTNFPFN(TP+FP)(TP+FN)(TN+FP)(TN+FN)MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}

Where: - TP = True Positives - TN = True Negatives
- FP = False Positives - FN = False Negatives

Numerical Stability

scGate implements numerically stable calculations:

# Simulate classification results
set.seed(42)
actual <- sample(c(0, 1), 100, replace = TRUE, prob = c(0.7, 0.3))
predicted <- actual
# Add some noise
predicted[sample(which(actual == 1), 5)] <- 0
predicted[sample(which(actual == 0), 3)] <- 1

# Calculate performance metrics
metrics <- performance.metrics(actual, predicted)
print(metrics)
#>      PREC       REC       MCC 
#> 0.9062500 0.8529412 0.8200066

Computational Considerations

Vectorized Operations

scGate uses vectorized operations for efficiency:

# Example: rowSums is much faster than apply
n <- 1000
m <- 100
mat <- matrix(runif(n * m), nrow = n)

# Timing comparison (conceptual)
# apply(mat, 1, sum)  # Slower
# rowSums(mat)        # Much faster - used by scGate

Parallel Processing

For multi-model classification, scGate supports parallel processing:

library(BiocParallel)

# Use multiple cores
result <- scGate(
  data = seurat_obj,
  model = model_list,
  ncores = 4  # Use 4 cores
)

# Or specify custom BPPARAM
result <- scGate(
  data = seurat_obj,
  model = model_list,
  BPPARAM = MulticoreParam(workers = 4)
)

Summary

The scGate algorithm combines three key components:

Component Purpose Key Parameter
UCell scoring Robust signature quantification maxRank
kNN smoothing Noise reduction k.param, smooth.decay
Hierarchical gating Multi-level classification param_decay

This combination enables accurate, interpretable cell type purification without requiring reference datasets.

References

  1. Andreatta M, Berenstein AJ, Carmona SJ. scGate: marker-based purification of cell types from heterogeneous single-cell RNA-seq datasets. Bioinformatics. 2022;38(9):2642-2644.

  2. Andreatta M, Carmona SJ. UCell: Robust and scalable single-cell gene signature scoring. Computational and Structural Biotechnology Journal. 2021;19:3796-3798.

Session Info

sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] patchwork_1.3.2    ggplot2_4.0.1      SeuratObject_4.1.4 Seurat_4.4.0      
#> [5] scGate_1.7.2      
#> 
#> loaded via a namespace (and not attached):
#>   [1] deldir_2.0-4           pbapply_1.7-4          gridExtra_2.3         
#>   [4] rlang_1.1.7            magrittr_2.0.4         RcppAnnoy_0.0.23      
#>   [7] otel_0.2.0             spatstat.geom_3.7-0    matrixStats_1.5.0     
#>  [10] ggridges_0.5.7         compiler_4.4.0         png_0.1-8             
#>  [13] systemfonts_1.3.1      vctrs_0.7.1            reshape2_1.4.5        
#>  [16] stringr_1.6.0          pkgconfig_2.0.3        fastmap_1.2.0         
#>  [19] promises_1.5.0         rmarkdown_2.30         ragg_1.5.0            
#>  [22] purrr_1.2.1            xfun_0.56              cachem_1.1.0          
#>  [25] jsonlite_2.0.0         goftest_1.2-3          later_1.4.5           
#>  [28] BiocParallel_1.40.2    spatstat.utils_3.2-1   irlba_2.3.5.1         
#>  [31] parallel_4.4.0         cluster_2.1.8.1        R6_2.6.1              
#>  [34] ica_1.0-3              spatstat.data_3.1-9    stringi_1.8.7         
#>  [37] bslib_0.9.0            RColorBrewer_1.1-3     reticulate_1.44.1     
#>  [40] spatstat.univar_3.1-6  parallelly_1.46.1      lmtest_0.9-40         
#>  [43] jquerylib_0.1.4        scattermore_1.2        Rcpp_1.1.1            
#>  [46] knitr_1.51             tensor_1.5.1           future.apply_1.20.1   
#>  [49] zoo_1.8-15             sctransform_0.4.3      httpuv_1.6.16         
#>  [52] Matrix_1.7-4           splines_4.4.0          igraph_2.2.1          
#>  [55] tidyselect_1.2.1       abind_1.4-8            dichromat_2.0-0.1     
#>  [58] yaml_2.3.12            spatstat.random_3.4-4  spatstat.explore_3.7-0
#>  [61] codetools_0.2-20       miniUI_0.1.2           listenv_0.10.0        
#>  [64] plyr_1.8.9             lattice_0.22-7         tibble_3.3.1          
#>  [67] withr_3.0.2            shiny_1.12.1           S7_0.2.1              
#>  [70] ROCR_1.0-12            evaluate_1.0.5         Rtsne_0.17            
#>  [73] future_1.69.0          desc_1.4.3             survival_3.8-3        
#>  [76] polyclip_1.10-7        fitdistrplus_1.2-5     pillar_1.11.1         
#>  [79] KernSmooth_2.23-26     plotly_4.11.0          generics_0.1.4        
#>  [82] sp_2.2-0               scales_1.4.0           globals_0.18.0        
#>  [85] xtable_1.8-4           glue_1.8.0             lazyeval_0.2.2        
#>  [88] tools_4.4.0            BiocNeighbors_2.0.1    data.table_1.18.0     
#>  [91] RANN_2.6.2             dotCall64_1.2          fs_1.6.6              
#>  [94] leiden_0.4.3.1         cowplot_1.2.0          grid_4.4.0            
#>  [97] tidyr_1.3.2            colorspace_2.1-2       nlme_3.1-168          
#> [100] cli_3.6.5              spatstat.sparse_3.1-0  textshaping_1.0.4     
#> [103] spam_2.11-3            viridisLite_0.4.2      dplyr_1.1.4           
#> [106] uwot_0.2.4             gtable_0.3.6           sass_0.4.10           
#> [109] digest_0.6.39          progressr_0.18.0       ggrepel_0.9.6         
#> [112] htmlwidgets_1.6.4      farver_2.1.2           htmltools_0.5.9       
#> [115] pkgdown_2.1.3          lifecycle_1.0.5        httr_1.4.7            
#> [118] mime_0.13              MASS_7.3-65