Algorithm Principles and Mathematical Foundation
Zaoqu Liu
2026-01-26
Source:vignettes/algorithm.Rmd
algorithm.RmdOverview
scClustEval implements a self-projection framework for evaluating and optimizing single-cell RNA-seq clustering. This vignette provides a comprehensive explanation of the underlying algorithms and mathematical foundations.
The Challenge of Cell Clustering Evaluation
In single-cell RNA-seq analysis, clustering algorithms aim to group cells with similar transcriptomic profiles. However, several challenges arise:
- Parameter Sensitivity: Clustering results are highly sensitive to resolution parameters
- Over-clustering: High resolution may split biologically homogeneous populations
- Under-clustering: Low resolution may merge distinct cell types
- Lack of Ground Truth: True cell type labels are rarely available
scClustEval addresses these challenges through a quantitative, data-driven approach.
Self-Projection Framework
Core Concept
The self-projection method treats cluster evaluation as a classification problem:
“If a clustering is biologically meaningful, a classifier should be able to accurately distinguish cells from different clusters based on their gene expression profiles.”
Mathematical Formulation
1. Data Partitioning
Given expression matrix with cells and features, and cluster labels :
Stratified sampling ensures each cluster contributes proportionally:
Confusion Matrix Normalization
R1 Normalization
R1 normalization measures pairwise confusion rate relative to correct classifications:
Interpretation: - : Perfect discrimination - : More misclassifications than correct classifications (severe confusion) - High R1 indicates clusters should potentially be merged
library(scClustEval)
# Example confusion matrix
cmat <- matrix(c(
90, 5, 3, # Cluster A: 90 correct, 5 misclassified as B, 3 as C
8, 85, 7, # Cluster B: 8 misclassified as A, 85 correct, 7 as C
2, 12, 86 # Cluster C: 2 as A, 12 as B, 86 correct
), nrow = 3, byrow = TRUE)
rownames(cmat) <- colnames(cmat) <- c("A", "B", "C")
# Compute R1 normalization
r1_mat <- normalize_confmat_r1(cmat)
print(round(r1_mat, 3))R2 Normalization
R2 normalization measures overall confusion proportion in the dataset:
Where .
Interpretation: - R2 represents the fraction of total cells confused between clusters and - Values range from 0 to 1 - Useful for identifying globally important confusions
# Compute R2 normalization
r2_mat <- normalize_confmat_r2(cmat)
print(round(r2_mat, 4))Visualization of Confusion Analysis
library(ggplot2)
library(gridExtra)
# Create data for visualization
df_r1 <- as.data.frame(as.table(r1_mat))
colnames(df_r1) <- c("True", "Predicted", "R1")
df_r2 <- as.data.frame(as.table(r2_mat))
colnames(df_r2) <- c("True", "Predicted", "R2")
p1 <- ggplot(df_r1, aes(x = Predicted, y = True, fill = R1)) +
geom_tile(color = "white") +
geom_text(aes(label = sprintf("%.2f", R1)), color = "white", size = 5) +
scale_fill_gradient(low = "gray90", high = "#d62728") +
labs(title = "R1 Normalization", subtitle = "Pairwise confusion rate") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
p2 <- ggplot(df_r2, aes(x = Predicted, y = True, fill = R2)) +
geom_tile(color = "white") +
geom_text(aes(label = sprintf("%.3f", R2)), color = "white", size = 5) +
scale_fill_gradient(low = "gray90", high = "#1f77b4") +
labs(title = "R2 Normalization", subtitle = "Global confusion proportion") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
grid.arrange(p1, p2, ncol = 2)Cluster Merging Strategy
Community Detection
The Louvain algorithm identifies communities (groups of confused clusters) by maximizing modularity:
Where: - = total number of edges - = degree of node - = community assignment of node - = Kronecker delta
# Demonstrate adjacency matrix clustering
adj <- matrix(c(
0, 0.3, 0.02, 0.01,
0.3, 0, 0.25, 0.02,
0.02, 0.25, 0, 0.01,
0.01, 0.02, 0.01, 0
), nrow = 4, byrow = TRUE)
rownames(adj) <- colnames(adj) <- c("C1", "C2", "C3", "C4")
# Cluster at threshold 0.1
groups <- cluster_adjacency_matrix(adj, cutoff = 0.1, resolution = 1.0)
names(groups) <- c("C1", "C2", "C3", "C4")
print(groups)
# Visualize
adj_df <- as.data.frame(as.table(adj))
colnames(adj_df) <- c("From", "To", "Weight")
ggplot(adj_df, aes(x = To, y = From, fill = Weight)) +
geom_tile(color = "white") +
geom_text(aes(label = sprintf("%.2f", Weight)), size = 4) +
scale_fill_gradient2(low = "white", mid = "#ffbb78", high = "#d62728", midpoint = 0.15) +
geom_hline(yintercept = 2.5, linetype = "dashed", color = "black", size = 1) +
geom_vline(xintercept = 2.5, linetype = "dashed", color = "black", size = 1) +
labs(title = "Confusion-Based Adjacency Matrix",
subtitle = "Dashed lines show detected communities (C1,C2 | C3,C4)",
fill = "R1 Value") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))Iterative Optimization
Classifier Comparison
Supported Algorithms
| Algorithm | Strengths | Best For |
|---|---|---|
| Logistic Regression (L1) | Sparse coefficients, interpretable | High-dimensional data, marker identification |
| Random Forest | Non-linear, feature importance | Complex boundaries |
| SVM (RBF) | Effective in high dimensions | Well-separated clusters |
| XGBoost | High accuracy, handles imbalance | Large datasets |
Summary
The scClustEval framework provides:
- Quantitative Assessment: Objective metrics for clustering quality
- Principled Optimization: Data-driven cluster merging
- Flexibility: Multiple classifiers and customizable parameters
- Interpretability: Confusion analysis reveals biological relationships
References
Miao, Z., et al. (2020). Putative cell type discovery from single-cell gene expression data. Nature Methods, 17, 621-628.
Blondel, V. D., et al. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics, P10008.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.
Author: Zaoqu Liu (liuzaoqu@163.com)
Package: scClustEval v1.0.0