Algorithm Principles and Mathematical Foundation

Overview

scClustEval implements a self-projection framework for evaluating and optimizing single-cell RNA-seq clustering. This vignette provides a comprehensive explanation of the underlying algorithms and mathematical foundations.

The Challenge of Cell Clustering Evaluation

In single-cell RNA-seq analysis, clustering algorithms aim to group cells with similar transcriptomic profiles. However, several challenges arise:

Parameter Sensitivity: Clustering results are highly sensitive to resolution parameters
Over-clustering: High resolution may split biologically homogeneous populations
Under-clustering: Low resolution may merge distinct cell types
Lack of Ground Truth: True cell type labels are rarely available

scClustEval addresses these challenges through a quantitative, data-driven approach.

Self-Projection Framework

Core Concept

The self-projection method treats cluster evaluation as a classification problem:

“If a clustering is biologically meaningful, a classifier should be able to accurately distinguish cells from different clusters based on their gene expression profiles.”

Algorithm Steps

Mathematical Formulation

1. Data Partitioning

Given expression matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$ with $n$ cells and $p$ features, and cluster labels $\mathbf{y} \in \{1, ..., K\}^n$ :

$\mathbf{X} = \mathbf{X}_{train} \cup \mathbf{X}_{test}$

Stratified sampling ensures each cluster $k$ contributes proportionally:

$\frac{n_k^{train}}{n^{train}} \approx \frac{n_k^{test}}{n^{test}} \approx \frac{n_k}{n}$

2. Classification Model

We train a multi-class classifier $f: \mathbb{R}^p \rightarrow \{1, ..., K\}$ that predicts cluster membership. The default is L1-regularized logistic regression:

$\min_{\boldsymbol{\beta}} -\sum_{i=1}^{n_{train}} \log P(y_i | \mathbf{x}_i, \boldsymbol{\beta}) + \lambda ||\boldsymbol{\beta}||_1$

3. Confusion Matrix

The confusion matrix $\mathbf{C} \in \mathbb{R}^{K \times K}$ is computed on the test set:

$C_{ij} = |\{x : y_{true}(x) = i \land y_{pred}(x) = j\}|$

Where: - $C_{ii}$ (diagonal): correctly classified cells from cluster $i$ - $C_{ij}$ (off-diagonal): cells from cluster $i$ misclassified as cluster $j$

Confusion Matrix Normalization

R1 Normalization

R1 normalization measures pairwise confusion rate relative to correct classifications:

$R1(i, j) = \max\left(\frac{C_{ij}}{C_{jj}}, \frac{C_{ji}}{C_{ii}}\right)$

Interpretation: - $R1 = 0$ : Perfect discrimination - $R1 > 1$ : More misclassifications than correct classifications (severe confusion) - High R1 indicates clusters should potentially be merged

library(scClustEval)

# Example confusion matrix
cmat <- matrix(c(
  90, 5, 3,    # Cluster A: 90 correct, 5 misclassified as B, 3 as C
  8, 85, 7,   # Cluster B: 8 misclassified as A, 85 correct, 7 as C  
  2, 12, 86   # Cluster C: 2 as A, 12 as B, 86 correct
), nrow = 3, byrow = TRUE)
rownames(cmat) <- colnames(cmat) <- c("A", "B", "C")

# Compute R1 normalization
r1_mat <- normalize_confmat_r1(cmat)
print(round(r1_mat, 3))

R2 Normalization

R2 normalization measures overall confusion proportion in the dataset:

$R2(i, j) = \frac{C_{ij} + C_{ji}}{N_{total}}$

Where $N_{total} = \sum_{i,j} C_{ij}$ .

Interpretation: - R2 represents the fraction of total cells confused between clusters $i$ and $j$ - Values range from 0 to 1 - Useful for identifying globally important confusions

# Compute R2 normalization
r2_mat <- normalize_confmat_r2(cmat)
print(round(r2_mat, 4))

Visualization of Confusion Analysis

library(ggplot2)
library(gridExtra)

# Create data for visualization
df_r1 <- as.data.frame(as.table(r1_mat))
colnames(df_r1) <- c("True", "Predicted", "R1")

df_r2 <- as.data.frame(as.table(r2_mat))
colnames(df_r2) <- c("True", "Predicted", "R2")

p1 <- ggplot(df_r1, aes(x = Predicted, y = True, fill = R1)) +
  geom_tile(color = "white") +
  geom_text(aes(label = sprintf("%.2f", R1)), color = "white", size = 5) +
  scale_fill_gradient(low = "gray90", high = "#d62728") +
  labs(title = "R1 Normalization", subtitle = "Pairwise confusion rate") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

p2 <- ggplot(df_r2, aes(x = Predicted, y = True, fill = R2)) +
  geom_tile(color = "white") +
  geom_text(aes(label = sprintf("%.3f", R2)), color = "white", size = 5) +
  scale_fill_gradient(low = "gray90", high = "#1f77b4") +
  labs(title = "R2 Normalization", subtitle = "Global confusion proportion") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

grid.arrange(p1, p2, ncol = 2)

Cluster Merging Strategy

Graph-Based Approach

Clusters exceeding confusion thresholds are connected in an adjacency graph:

$A_{ij} = \begin{cases} 1 & \text{if } R1(i,j) > \theta_{R1} \text{ or } R2(i,j) > \theta_{R2} \\ 0 & \text{otherwise} \end{cases}$

Community Detection

The Louvain algorithm identifies communities (groups of confused clusters) by maximizing modularity:

$Q = \frac{1}{2m}\sum_{ij}\left[A_{ij} - \frac{k_i k_j}{2m}\right]\delta(c_i, c_j)$

Where: - $m$ = total number of edges - $k_i$ = degree of node $i$ - $c_i$ = community assignment of node $i$ - $\delta$ = Kronecker delta

# Demonstrate adjacency matrix clustering
adj <- matrix(c(
  0, 0.3, 0.02, 0.01,
  0.3, 0, 0.25, 0.02,
  0.02, 0.25, 0, 0.01,
  0.01, 0.02, 0.01, 0
), nrow = 4, byrow = TRUE)
rownames(adj) <- colnames(adj) <- c("C1", "C2", "C3", "C4")

# Cluster at threshold 0.1
groups <- cluster_adjacency_matrix(adj, cutoff = 0.1, resolution = 1.0)
names(groups) <- c("C1", "C2", "C3", "C4")
print(groups)

# Visualize
adj_df <- as.data.frame(as.table(adj))
colnames(adj_df) <- c("From", "To", "Weight")

ggplot(adj_df, aes(x = To, y = From, fill = Weight)) +
  geom_tile(color = "white") +
  geom_text(aes(label = sprintf("%.2f", Weight)), size = 4) +
  scale_fill_gradient2(low = "white", mid = "#ffbb78", high = "#d62728", midpoint = 0.15) +
  geom_hline(yintercept = 2.5, linetype = "dashed", color = "black", size = 1) +
  geom_vline(xintercept = 2.5, linetype = "dashed", color = "black", size = 1) +
  labs(title = "Confusion-Based Adjacency Matrix",
       subtitle = "Dashed lines show detected communities (C1,C2 | C3,C4)",
       fill = "R1 Value") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

Iterative Optimization

Convergence Criteria

The optimization continues until:

Target accuracy reached: $Accuracy \geq \theta_{acc}$
Maximum rounds exceeded: $round > max\_rounds$
Convergence: No clusters merged in current round

Cutoff Adaptation

Cutoffs are progressively relaxed each outer iteration:

$\theta_{R1}^{(t+1)} = \theta_{R1}^{(t)} - \Delta_{R1}$ $\theta_{R2}^{(t+1)} = \theta_{R2}^{(t)} - \Delta_{R2}$

Classifier Comparison

Supported Algorithms

Algorithm	Strengths	Best For
Logistic Regression (L1)	Sparse coefficients, interpretable	High-dimensional data, marker identification
Random Forest	Non-linear, feature importance	Complex boundaries
SVM (RBF)	Effective in high dimensions	Well-separated clusters
XGBoost	High accuracy, handles imbalance	Large datasets

Mathematical Details

Logistic Regression (multinomial with L1): $P(y=k|x) = \frac{\exp(\beta_k^T x)}{\sum_{j=1}^{K}\exp(\beta_j^T x)}$

Random Forest prediction: $\hat{y} = \text{mode}\{h_1(x), h_2(x), ..., h_B(x)\}$

Where $h_b$ are individual decision trees.

Performance Metrics

Per-Cluster Accuracy

$Accuracy_k = \frac{C_{kk}}{\sum_{j=1}^{K} C_{kj}}$

Overall Accuracy

$Accuracy = \frac{\sum_{k=1}^{K} C_{kk}}{N_{total}}$

Summary

The scClustEval framework provides:

Quantitative Assessment: Objective metrics for clustering quality
Principled Optimization: Data-driven cluster merging
Flexibility: Multiple classifiers and customizable parameters
Interpretability: Confusion analysis reveals biological relationships

References

Miao, Z., et al. (2020). Putative cell type discovery from single-cell gene expression data. Nature Methods, 17, 621-628.
Blondel, V. D., et al. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics, P10008.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.

Author: Zaoqu Liu (liuzaoqu@163.com)
Package: scClustEval v1.0.0

Zaoqu Liu

2026-01-26