Skip to contents

Overview

scClustEval (Single Cell Clustering Evaluation) is an R package for evaluating and optimizing single-cell RNA-seq clustering results using self-projection machine learning approaches.

The package implements an iterative optimization strategy that:

  1. Trains a classifier to distinguish between clusters
  2. Evaluates prediction accuracy via cross-validation
  3. Identifies cluster pairs that are difficult to discriminate
  4. Merges confused clusters iteratively until target accuracy is reached

Installation

# From GitHub
devtools::install_github("Zaoqu-Liu/scClustEval")

Quick Start

Loading the package

Basic Assessment with Matrix Input

# Create example data
set.seed(42)
n_cells <- 500
n_features <- 100
n_clusters <- 5

# Generate expression matrix with cluster structure
X <- matrix(0, nrow = n_cells, ncol = n_features)
labels <- character(n_cells)

for (i in 1:n_clusters) {
  idx <- ((i-1) * 100 + 1):(i * 100)
  X[idx, ] <- matrix(rnorm(100 * n_features, mean = i), nrow = 100)
  labels[idx] <- paste0("Cluster_", i)
}

# Run assessment
result <- sc_assessment(
  X = X,
  labels = labels,
  classifier = "LR",
  n_per_class = 50,
  cv = 5
)

# Print result
print(result)

With Seurat Objects

library(Seurat)

# Load your Seurat object
seurat_obj <- readRDS("your_data.rds")

# Run assessment on existing clustering
result <- RunAssessment(
  seurat_obj,
  cluster_col = "seurat_clusters",
  use = "pca",
  dims = 1:30
)

# View results
print(result)

# Plot ROC curves
plot_roc(result)

Clustering Optimization

The Optimization Process

The optimization process works as follows:

  1. Start with an over-clustered result (high resolution)
  2. Assess the clustering using self-projection
  3. Build a confusion matrix to identify confused cluster pairs
  4. Merge clusters that cannot be well discriminated
  5. Repeat until target accuracy is reached
# Start with over-clustering
seurat_obj <- FindClusters(seurat_obj, resolution = 2.0)

# Run optimization
seurat_obj <- RunOptimization(
  seurat_obj,
  cluster_col = "seurat_clusters",
  min_accuracy = 0.9,
  result_col = "optimized_clusters"
)

# Compare before and after
DimPlot(seurat_obj, group.by = c("seurat_clusters", "optimized_clusters"))

Visualization Functions

ROC Curves

# Plot ROC and Precision-Recall curves
plot_roc(result, plot_type = "both")

# ROC only
plot_roc(result, plot_type = "roc")

Confusion Matrix Heatmap

# Raw confusion matrix
plot_confusion_heatmap(result, normalized = "raw")

# R1-normalized (used for merging decisions)
plot_confusion_heatmap(result, normalized = "R1")

Optimization History

# Run optimization with matrix input
optim_result <- sc_optimize_all(
  X = X,
  labels = initial_labels,
  min_accuracy = 0.9
)

# Plot optimization progress
plot_optimization_history(optim_result)

Classifier Options

The package supports multiple classifiers:

Classifier Code Description
Logistic Regression "LR" L1/L2 regularized (default)
Random Forest "RF" Using randomForest package
Ranger "RANGER" Fast random forest
SVM "SVM" Support Vector Machine
Naive Bayes "NB" Gaussian Naive Bayes
Decision Tree "DT" Using rpart
XGBoost "XGB" Gradient boosting
# Using different classifiers
result_lr <- sc_assessment(X, labels, classifier = "LR")
result_rf <- sc_assessment(X, labels, classifier = "RF")
result_svm <- sc_assessment(X, labels, classifier = "SVM")

Advanced Usage

Using Constraints

You can constrain the optimization process using an under-clustering as a boundary:

# Create low and high resolution clusterings
seurat_obj <- FindClusters(seurat_obj, resolution = 0.2, key_added = "low_res")
seurat_obj <- FindClusters(seurat_obj, resolution = 2.0, key_added = "high_res")

# Optimize with constraint
seurat_obj <- RunOptimization(
  seurat_obj,
  cluster_col = "high_res",
  under_cluster_col = "low_res",  # Constraint
  min_accuracy = 0.95
)

Parallel Processing

# Assessment uses parallel processing automatically
# Control with n_cores parameter
result <- sc_assessment(
  X, labels,
  n_cores = 4  # Use 4 cores
)

Session Info

sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] scClustEval_1.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] shape_1.4.6.1        gtable_0.3.6         xfun_0.56           
#>  [4] bslib_0.9.0          ggplot2_4.0.1        htmlwidgets_1.6.4   
#>  [7] recipes_1.3.1        lattice_0.22-7       vctrs_0.7.1         
#> [10] tools_4.4.0          generics_0.1.4       stats4_4.4.0        
#> [13] parallel_4.4.0       tibble_3.3.1         ModelMetrics_1.2.2.2
#> [16] pkgconfig_2.0.3      Matrix_1.7-4         data.table_1.18.0   
#> [19] RColorBrewer_1.1-3   S7_0.2.1             desc_1.4.3          
#> [22] lifecycle_1.0.5      stringr_1.6.0        compiler_4.4.0      
#> [25] farver_2.1.2         textshaping_1.0.4    codetools_0.2-20    
#> [28] htmltools_0.5.9      class_7.3-23         sass_0.4.10         
#> [31] glmnet_4.1-10        yaml_2.3.12          prodlim_2025.04.28  
#> [34] pillar_1.11.1        pkgdown_2.1.3        jquerylib_0.1.4     
#> [37] MASS_7.3-65          cachem_1.1.0         gower_1.0.2         
#> [40] iterators_1.0.14     rpart_4.1.24         foreach_1.5.2       
#> [43] nlme_3.1-168         parallelly_1.46.1    lava_1.8.2          
#> [46] tidyselect_1.2.1     digest_0.6.39        stringi_1.8.7       
#> [49] future_1.69.0        reshape2_1.4.5       purrr_1.2.1         
#> [52] dplyr_1.1.4          listenv_0.10.0       splines_4.4.0       
#> [55] fastmap_1.2.0        grid_4.4.0           cli_3.6.5           
#> [58] magrittr_2.0.4       dichromat_2.0-0.1    survival_3.8-3      
#> [61] future.apply_1.20.1  withr_3.0.2          scales_1.4.0        
#> [64] lubridate_1.9.4      timechange_0.3.0     rmarkdown_2.30      
#> [67] globals_0.18.0       igraph_2.2.1         otel_0.2.0          
#> [70] nnet_7.3-20          timeDate_4051.111    ragg_1.5.0          
#> [73] evaluate_1.0.5       knitr_1.51           hardhat_1.4.2       
#> [76] caret_7.0-1          rlang_1.1.7          Rcpp_1.1.1          
#> [79] glue_1.8.0           pROC_1.19.0.1        ipred_0.9-15        
#> [82] jsonlite_2.0.0       R6_2.6.1             plyr_1.8.9          
#> [85] systemfonts_1.3.1    fs_1.6.6

References

This package is an R implementation inspired by the SCCAF Python package:

  • Miao, Z., et al. (2020). Putative cell type discovery from single-cell gene expression data. Nature Methods.
  • SCCAF GitHub: https://github.com/SCCAF/sccaf