Introduction to scClustEval

Overview

scClustEval (Single Cell Clustering Evaluation) is an R package for evaluating and optimizing single-cell RNA-seq clustering results using self-projection machine learning approaches.

The package implements an iterative optimization strategy that:

Trains a classifier to distinguish between clusters
Evaluates prediction accuracy via cross-validation
Identifies cluster pairs that are difficult to discriminate
Merges confused clusters iteratively until target accuracy is reached

Installation

# From GitHub
devtools::install_github("Zaoqu-Liu/scClustEval")

Quick Start

Loading the package

library(scClustEval)

Basic Assessment with Matrix Input

# Create example data
set.seed(42)
n_cells <- 500
n_features <- 100
n_clusters <- 5

# Generate expression matrix with cluster structure
X <- matrix(0, nrow = n_cells, ncol = n_features)
labels <- character(n_cells)

for (i in 1:n_clusters) {
  idx <- ((i-1) * 100 + 1):(i * 100)
  X[idx, ] <- matrix(rnorm(100 * n_features, mean = i), nrow = 100)
  labels[idx] <- paste0("Cluster_", i)
}

# Run assessment
result <- sc_assessment(
  X = X,
  labels = labels,
  classifier = "LR",
  n_per_class = 50,
  cv = 5
)

# Print result
print(result)

With Seurat Objects

library(Seurat)

# Load your Seurat object
seurat_obj <- readRDS("your_data.rds")

# Run assessment on existing clustering
result <- RunAssessment(
  seurat_obj,
  cluster_col = "seurat_clusters",
  use = "pca",
  dims = 1:30
)

# View results
print(result)

# Plot ROC curves
plot_roc(result)

Clustering Optimization

The Optimization Process

The optimization process works as follows:

Start with an over-clustered result (high resolution)
Assess the clustering using self-projection
Build a confusion matrix to identify confused cluster pairs
Merge clusters that cannot be well discriminated
Repeat until target accuracy is reached

# Start with over-clustering
seurat_obj <- FindClusters(seurat_obj, resolution = 2.0)

# Run optimization
seurat_obj <- RunOptimization(
  seurat_obj,
  cluster_col = "seurat_clusters",
  min_accuracy = 0.9,
  result_col = "optimized_clusters"
)

# Compare before and after
DimPlot(seurat_obj, group.by = c("seurat_clusters", "optimized_clusters"))

Visualization Functions

ROC Curves

# Plot ROC and Precision-Recall curves
plot_roc(result, plot_type = "both")

# ROC only
plot_roc(result, plot_type = "roc")

Confusion Matrix Heatmap

# Raw confusion matrix
plot_confusion_heatmap(result, normalized = "raw")

# R1-normalized (used for merging decisions)
plot_confusion_heatmap(result, normalized = "R1")

Optimization History

# Run optimization with matrix input
optim_result <- sc_optimize_all(
  X = X,
  labels = initial_labels,
  min_accuracy = 0.9
)

# Plot optimization progress
plot_optimization_history(optim_result)

Classifier Options

The package supports multiple classifiers:

Classifier	Code	Description
Logistic Regression	`"LR"`	L1/L2 regularized (default)
Random Forest	`"RF"`	Using randomForest package
Ranger	`"RANGER"`	Fast random forest
SVM	`"SVM"`	Support Vector Machine
Naive Bayes	`"NB"`	Gaussian Naive Bayes
Decision Tree	`"DT"`	Using rpart
XGBoost	`"XGB"`	Gradient boosting

# Using different classifiers
result_lr <- sc_assessment(X, labels, classifier = "LR")
result_rf <- sc_assessment(X, labels, classifier = "RF")
result_svm <- sc_assessment(X, labels, classifier = "SVM")

Advanced Usage

Using Constraints

You can constrain the optimization process using an under-clustering as a boundary:

# Create low and high resolution clusterings
seurat_obj <- FindClusters(seurat_obj, resolution = 0.2, key_added = "low_res")
seurat_obj <- FindClusters(seurat_obj, resolution = 2.0, key_added = "high_res")

# Optimize with constraint
seurat_obj <- RunOptimization(
  seurat_obj,
  cluster_col = "high_res",
  under_cluster_col = "low_res",  # Constraint
  min_accuracy = 0.95
)

Parallel Processing

# Assessment uses parallel processing automatically
# Control with n_cores parameter
result <- sc_assessment(
  X, labels,
  n_cores = 4  # Use 4 cores
)

Session Info

sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] scClustEval_1.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] shape_1.4.6.1        gtable_0.3.6         xfun_0.56           
#>  [4] bslib_0.9.0          ggplot2_4.0.1        htmlwidgets_1.6.4   
#>  [7] recipes_1.3.1        lattice_0.22-7       vctrs_0.7.1         
#> [10] tools_4.4.0          generics_0.1.4       stats4_4.4.0        
#> [13] parallel_4.4.0       tibble_3.3.1         ModelMetrics_1.2.2.2
#> [16] pkgconfig_2.0.3      Matrix_1.7-4         data.table_1.18.0   
#> [19] RColorBrewer_1.1-3   S7_0.2.1             desc_1.4.3          
#> [22] lifecycle_1.0.5      stringr_1.6.0        compiler_4.4.0      
#> [25] farver_2.1.2         textshaping_1.0.4    codetools_0.2-20    
#> [28] htmltools_0.5.9      class_7.3-23         sass_0.4.10         
#> [31] glmnet_4.1-10        yaml_2.3.12          prodlim_2025.04.28  
#> [34] pillar_1.11.1        pkgdown_2.1.3        jquerylib_0.1.4     
#> [37] MASS_7.3-65          cachem_1.1.0         gower_1.0.2         
#> [40] iterators_1.0.14     rpart_4.1.24         foreach_1.5.2       
#> [43] nlme_3.1-168         parallelly_1.46.1    lava_1.8.2          
#> [46] tidyselect_1.2.1     digest_0.6.39        stringi_1.8.7       
#> [49] future_1.69.0        reshape2_1.4.5       purrr_1.2.1         
#> [52] dplyr_1.1.4          listenv_0.10.0       splines_4.4.0       
#> [55] fastmap_1.2.0        grid_4.4.0           cli_3.6.5           
#> [58] magrittr_2.0.4       dichromat_2.0-0.1    survival_3.8-3      
#> [61] future.apply_1.20.1  withr_3.0.2          scales_1.4.0        
#> [64] lubridate_1.9.4      timechange_0.3.0     rmarkdown_2.30      
#> [67] globals_0.18.0       igraph_2.2.1         otel_0.2.0          
#> [70] nnet_7.3-20          timeDate_4051.111    ragg_1.5.0          
#> [73] evaluate_1.0.5       knitr_1.51           hardhat_1.4.2       
#> [76] caret_7.0-1          rlang_1.1.7          Rcpp_1.1.1          
#> [79] glue_1.8.0           pROC_1.19.0.1        ipred_0.9-15        
#> [82] jsonlite_2.0.0       R6_2.6.1             plyr_1.8.9          
#> [85] systemfonts_1.3.1    fs_1.6.6

References

This package is an R implementation inspired by the SCCAF Python package:

Miao, Z., et al. (2020). Putative cell type discovery from single-cell gene expression data. Nature Methods.
SCCAF GitHub: https://github.com/SCCAF/sccaf

Zaoqu Liu

2026-01-26