Introduction to scClustEval
Zaoqu Liu
2026-01-26
Source:vignettes/introduction.Rmd
introduction.RmdOverview
scClustEval (Single Cell Clustering Evaluation) is an R package for evaluating and optimizing single-cell RNA-seq clustering results using self-projection machine learning approaches.
The package implements an iterative optimization strategy that:
- Trains a classifier to distinguish between clusters
- Evaluates prediction accuracy via cross-validation
- Identifies cluster pairs that are difficult to discriminate
- Merges confused clusters iteratively until target accuracy is reached
Installation
# From GitHub
devtools::install_github("Zaoqu-Liu/scClustEval")Quick Start
Basic Assessment with Matrix Input
# Create example data
set.seed(42)
n_cells <- 500
n_features <- 100
n_clusters <- 5
# Generate expression matrix with cluster structure
X <- matrix(0, nrow = n_cells, ncol = n_features)
labels <- character(n_cells)
for (i in 1:n_clusters) {
idx <- ((i-1) * 100 + 1):(i * 100)
X[idx, ] <- matrix(rnorm(100 * n_features, mean = i), nrow = 100)
labels[idx] <- paste0("Cluster_", i)
}
# Run assessment
result <- sc_assessment(
X = X,
labels = labels,
classifier = "LR",
n_per_class = 50,
cv = 5
)
# Print result
print(result)Clustering Optimization
The Optimization Process
The optimization process works as follows:
- Start with an over-clustered result (high resolution)
- Assess the clustering using self-projection
- Build a confusion matrix to identify confused cluster pairs
- Merge clusters that cannot be well discriminated
- Repeat until target accuracy is reached
# Start with over-clustering
seurat_obj <- FindClusters(seurat_obj, resolution = 2.0)
# Run optimization
seurat_obj <- RunOptimization(
seurat_obj,
cluster_col = "seurat_clusters",
min_accuracy = 0.9,
result_col = "optimized_clusters"
)
# Compare before and after
DimPlot(seurat_obj, group.by = c("seurat_clusters", "optimized_clusters"))Visualization Functions
Confusion Matrix Heatmap
# Raw confusion matrix
plot_confusion_heatmap(result, normalized = "raw")
# R1-normalized (used for merging decisions)
plot_confusion_heatmap(result, normalized = "R1")Optimization History
# Run optimization with matrix input
optim_result <- sc_optimize_all(
X = X,
labels = initial_labels,
min_accuracy = 0.9
)
# Plot optimization progress
plot_optimization_history(optim_result)Classifier Options
The package supports multiple classifiers:
| Classifier | Code | Description |
|---|---|---|
| Logistic Regression | "LR" |
L1/L2 regularized (default) |
| Random Forest | "RF" |
Using randomForest package |
| Ranger | "RANGER" |
Fast random forest |
| SVM | "SVM" |
Support Vector Machine |
| Naive Bayes | "NB" |
Gaussian Naive Bayes |
| Decision Tree | "DT" |
Using rpart |
| XGBoost | "XGB" |
Gradient boosting |
# Using different classifiers
result_lr <- sc_assessment(X, labels, classifier = "LR")
result_rf <- sc_assessment(X, labels, classifier = "RF")
result_svm <- sc_assessment(X, labels, classifier = "SVM")Advanced Usage
Using Constraints
You can constrain the optimization process using an under-clustering as a boundary:
# Create low and high resolution clusterings
seurat_obj <- FindClusters(seurat_obj, resolution = 0.2, key_added = "low_res")
seurat_obj <- FindClusters(seurat_obj, resolution = 2.0, key_added = "high_res")
# Optimize with constraint
seurat_obj <- RunOptimization(
seurat_obj,
cluster_col = "high_res",
under_cluster_col = "low_res", # Constraint
min_accuracy = 0.95
)Parallel Processing
# Assessment uses parallel processing automatically
# Control with n_cores parameter
result <- sc_assessment(
X, labels,
n_cores = 4 # Use 4 cores
)Session Info
sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C
#>
#> time zone: Asia/Shanghai
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] scClustEval_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] shape_1.4.6.1 gtable_0.3.6 xfun_0.56
#> [4] bslib_0.9.0 ggplot2_4.0.1 htmlwidgets_1.6.4
#> [7] recipes_1.3.1 lattice_0.22-7 vctrs_0.7.1
#> [10] tools_4.4.0 generics_0.1.4 stats4_4.4.0
#> [13] parallel_4.4.0 tibble_3.3.1 ModelMetrics_1.2.2.2
#> [16] pkgconfig_2.0.3 Matrix_1.7-4 data.table_1.18.0
#> [19] RColorBrewer_1.1-3 S7_0.2.1 desc_1.4.3
#> [22] lifecycle_1.0.5 stringr_1.6.0 compiler_4.4.0
#> [25] farver_2.1.2 textshaping_1.0.4 codetools_0.2-20
#> [28] htmltools_0.5.9 class_7.3-23 sass_0.4.10
#> [31] glmnet_4.1-10 yaml_2.3.12 prodlim_2025.04.28
#> [34] pillar_1.11.1 pkgdown_2.1.3 jquerylib_0.1.4
#> [37] MASS_7.3-65 cachem_1.1.0 gower_1.0.2
#> [40] iterators_1.0.14 rpart_4.1.24 foreach_1.5.2
#> [43] nlme_3.1-168 parallelly_1.46.1 lava_1.8.2
#> [46] tidyselect_1.2.1 digest_0.6.39 stringi_1.8.7
#> [49] future_1.69.0 reshape2_1.4.5 purrr_1.2.1
#> [52] dplyr_1.1.4 listenv_0.10.0 splines_4.4.0
#> [55] fastmap_1.2.0 grid_4.4.0 cli_3.6.5
#> [58] magrittr_2.0.4 dichromat_2.0-0.1 survival_3.8-3
#> [61] future.apply_1.20.1 withr_3.0.2 scales_1.4.0
#> [64] lubridate_1.9.4 timechange_0.3.0 rmarkdown_2.30
#> [67] globals_0.18.0 igraph_2.2.1 otel_0.2.0
#> [70] nnet_7.3-20 timeDate_4051.111 ragg_1.5.0
#> [73] evaluate_1.0.5 knitr_1.51 hardhat_1.4.2
#> [76] caret_7.0-1 rlang_1.1.7 Rcpp_1.1.1
#> [79] glue_1.8.0 pROC_1.19.0.1 ipred_0.9-15
#> [82] jsonlite_2.0.0 R6_2.6.1 plyr_1.8.9
#> [85] systemfonts_1.3.1 fs_1.6.6References
This package is an R implementation inspired by the SCCAF Python package:
- Miao, Z., et al. (2020). Putative cell type discovery from single-cell gene expression data. Nature Methods.
- SCCAF GitHub: https://github.com/SCCAF/sccaf