Algorithm and Mathematical Framework • hdWGCNA

Introduction

hdWGCNA (High-Dimensional Weighted Gene Co-expression Network Analysis) extends the classical WGCNA framework to handle the unique challenges posed by single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics data. This vignette provides a comprehensive overview of the underlying algorithms and mathematical frameworks.

The Challenge of Single-Cell Data

Traditional WGCNA was designed for bulk RNA-seq data, where each sample provides a robust estimate of gene expression. Single-cell data presents unique challenges:

Sparsity: Many genes have zero counts in individual cells due to dropout
High dimensionality: Thousands of cells with thousands of genes
Biological heterogeneity: Multiple cell types with distinct expression programs
Technical noise: Cell-to-cell variability from technical factors

hdWGCNA addresses these challenges through metacell aggregation and cell-type-specific network construction.

Part 1: Metacell Aggregation

Motivation

The sparsity of single-cell data ( $>90\%$ zeros in many datasets) makes direct correlation computation unreliable. Metacells aggregate similar cells to create “pseudo-samples” with more robust expression estimates.

Algorithm

Step 1: K-Nearest Neighbor (KNN) Graph Construction

For each cell $i$ , we identify its $k$ nearest neighbors in a reduced dimensional space (typically PCA or UMAP embeddings):

$\text{KNN}(i) = \{j_1, j_2, \ldots, j_k\} \quad \text{where} \quad d(i, j_1) \leq d(i, j_2) \leq \ldots \leq d(i, j_k)$

where $d(i,j)$ is the Euclidean distance in the embedding space.

Step 2: Metacell Seed Selection

We use an iterative bootstrapping algorithm to select metacell seeds that maximize coverage while minimizing overlap:

Algorithm: MetacellSeedSelection
Input: Cell set C, KNN graph, max_shared threshold τ
Output: Metacell seeds S

1. Initialize: S = {}, available = C
2. While |available| > 0 and |S| < target:
   a. Randomly sample seed cell s from available
   b. Get neighbors: N_s = KNN(s) ∪ {s}
   c. For each existing seed t ∈ S:
      - Compute overlap: O_st = |N_s ∩ N_t|
      - If O_st > τ: reject s, goto step 2a
   d. Accept s: S = S ∪ {s}
   e. Update available (optional)
3. Return S

Step 3: Expression Aggregation

For each metacell $m$ with constituent cells $C_m$ , the aggregated expression is:

$E_{gm} = \frac{1}{|C_m|} \sum_{i \in C_m} X_{gi}$

where $X_{gi}$ is the expression of gene $g$ in cell $i$ .

Metacell Quality Metrics

Overlap coefficient: $\text{Overlap}(m_1, m_2) = \frac{|C_{m_1} \cap C_{m_2}|}{k}$
Density improvement: $\rho_{metacell} / \rho_{single-cell}$

Part 2: Weighted Gene Co-expression Network Construction

Correlation Matrix

The gene-gene correlation matrix $S$ is computed from the metacell expression matrix:

$S_{ij} = \text{cor}(E_{i\cdot}, E_{j\cdot})$

where $E_{i\cdot}$ represents the expression profile of gene $i$ across all metacells.

hdWGCNA supports multiple correlation methods: - Pearson correlation: Standard linear correlation - Bicor (biweight midcorrelation): Robust to outliers

Soft Power Thresholding

The adjacency matrix $A$ is computed using a soft power transformation:

$A_{ij} = |S_{ij}|^\beta$

where $\beta$ is the soft power threshold. This transformation emphasizes strong correlations while preserving weak ones.

Scale-Free Topology Criterion

The optimal $\beta$ is selected to approximate scale-free network topology. For a scale-free network, the connectivity distribution follows:

$P(k) \propto k^{-\gamma}$

We assess scale-free topology fit using:

$R^2 = \text{corr}(\log(k), \log(P(k)))^2$

A value of $R^2 \geq 0.8$ indicates good scale-free topology fit.

# Example: Testing soft powers
library(hdWGCNA)
seurat_obj <- TestSoftPowers(seurat_obj, powers = c(1:10, seq(12, 30, by=2)))

Topological Overlap Matrix (TOM)

The TOM measures interconnectedness considering shared neighbors:

$\text{TOM}_{ij} = \frac{\sum_u A_{iu} A_{uj} + A_{ij}}{\min(k_i, k_j) + 1 - A_{ij}}$

where $k_i = \sum_u A_{iu}$ is the connectivity of node $i$ .

The TOM-based dissimilarity is:

$d_{ij} = 1 - \text{TOM}_{ij}$

Part 3: Module Detection

Hierarchical Clustering

Genes are clustered using average linkage hierarchical clustering on the TOM-based dissimilarity matrix.

Dynamic Tree Cut

The Dynamic Tree Cut algorithm identifies modules by adaptively cutting the dendrogram. Key parameters:

Parameter	Description	Default
`deepSplit`	Controls module granularity (0-4)	4
`minModuleSize`	Minimum genes per module	50
`mergeCutHeight`	Threshold for merging similar modules	0.2

Module Merging

Modules with highly correlated eigengenes are merged:

$\text{merge}(M_1, M_2) \quad \text{if} \quad \text{cor}(E_{M_1}, E_{M_2}) > 1 - h$

where $h$ is the mergeCutHeight.

Part 4: Module Eigengenes

Definition

The module eigengene $E_M$ is the first principal component of the module’s expression matrix:

$E_M = \mathbf{u}_1^T \cdot X_M$

where $\mathbf{u}_1$ is the first eigenvector of the covariance matrix of $X_M$ .

Module Expression Score (kME)

The eigengene-based connectivity measures how well each gene correlates with the module eigengene:

$\text{kME}_{g,M} = \text{cor}(X_g, E_M)$

Genes with high kME values are hub genes - central regulators of the module.

Part 5: Transcription Factor Network Analysis

XGBoost-Based Regulatory Inference

hdWGCNA uses gradient boosting (XGBoost) to infer TF-gene regulatory relationships:

$\hat{y}_g = \sum_{m=1}^{M} f_m(\mathbf{x}_{TF})$

where $\hat{y}_g$ is the predicted expression of target gene $g$ , and $f_m$ are regression trees trained on TF expression profiles.

Feature Importance

The importance of TF $t$ for predicting gene $g$ is computed as the average gain across all trees:

$\text{Gain}(t \rightarrow g) = \frac{1}{M} \sum_{m=1}^{M} \text{Gain}_m(t)$

Regulon Definition

A regulon is the set of target genes for a TF. Three strategies are available:

Strategy	Description
A	Top $n$ TFs for each gene (gene-centric)
B	Top $n$ genes for each TF (TF-centric)
C	All pairs above threshold (permissive)

Part 6: Statistical Framework

Module Preservation

Module preservation statistics assess whether modules identified in one dataset are reproducible in another:

Zsummary Statistic

$Z_{summary} = \frac{Z_{density} + Z_{connectivity}}{2}$

Interpretation: - $Z_{summary} < 2$ : No evidence of preservation - $2 \leq Z_{summary} < 10$ : Weak to moderate preservation - $Z_{summary} \geq 10$ : Strong preservation

Differential Module Eigengene (DME) Analysis

DME analysis compares module activity between conditions using standard statistical tests (Wilcoxon rank-sum, t-test, etc.):

$H_0: \mu_{E_M|A} = \mu_{E_M|B}$

Module-Trait Correlation

The correlation between module eigengenes and sample traits:

$r_{M,T} = \text{cor}(E_M, T)$

with significance assessed via permutation or asymptotic t-test.

Computational Considerations

Memory Management

For large datasets, hdWGCNA uses: - Block-wise network construction - Disk caching of TOM matrices - Sparse matrix operations where possible

Parallelization

Key functions support parallel execution via: - BiocParallel for cross-platform parallelization - Internal chunking for memory-efficient processing

References

Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics (2008).
Morabito S, et al. hdWGCNA identifies co-expression networks in high-dimensional transcriptomics data. Cell Reports Methods (2023).
Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology (2005).
Langfelder P, et al. Is My Network Module Preserved and Reproducible? PLoS Computational Biology (2011).

Session Information

sessionInfo()

## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] C
## 
## time zone: Asia/Shanghai
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.39     desc_1.4.3        R6_2.6.1          fastmap_1.2.0    
##  [5] xfun_0.56         cachem_1.1.0      knitr_1.51        htmltools_0.5.9  
##  [9] rmarkdown_2.30    lifecycle_1.0.5   cli_3.6.5         sass_0.4.10      
## [13] pkgdown_2.1.3     textshaping_1.0.4 jquerylib_0.1.4   systemfonts_1.3.1
## [17] compiler_4.4.0    tools_4.4.0       ragg_1.5.0        bslib_0.9.0      
## [21] evaluate_1.0.5    yaml_2.3.12       otel_0.2.0        jsonlite_2.0.0   
## [25] rlang_1.1.7       fs_1.6.6          htmlwidgets_1.6.4