Skip to contents

Introduction

hdWGCNA (High-Dimensional Weighted Gene Co-expression Network Analysis) extends the classical WGCNA framework to handle the unique challenges posed by single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics data. This vignette provides a comprehensive overview of the underlying algorithms and mathematical frameworks.

The Challenge of Single-Cell Data

Traditional WGCNA was designed for bulk RNA-seq data, where each sample provides a robust estimate of gene expression. Single-cell data presents unique challenges:

  • Sparsity: Many genes have zero counts in individual cells due to dropout
  • High dimensionality: Thousands of cells with thousands of genes
  • Biological heterogeneity: Multiple cell types with distinct expression programs
  • Technical noise: Cell-to-cell variability from technical factors

hdWGCNA addresses these challenges through metacell aggregation and cell-type-specific network construction.


Part 1: Metacell Aggregation

Motivation

The sparsity of single-cell data (>90%>90\% zeros in many datasets) makes direct correlation computation unreliable. Metacells aggregate similar cells to create “pseudo-samples” with more robust expression estimates.

Algorithm

Step 1: K-Nearest Neighbor (KNN) Graph Construction

For each cell ii, we identify its kk nearest neighbors in a reduced dimensional space (typically PCA or UMAP embeddings):

KNN(i)={j1,j2,,jk}whered(i,j1)d(i,j2)d(i,jk) \text{KNN}(i) = \{j_1, j_2, \ldots, j_k\} \quad \text{where} \quad d(i, j_1) \leq d(i, j_2) \leq \ldots \leq d(i, j_k)

where d(i,j)d(i,j) is the Euclidean distance in the embedding space.

Step 2: Metacell Seed Selection

We use an iterative bootstrapping algorithm to select metacell seeds that maximize coverage while minimizing overlap:

Algorithm: MetacellSeedSelection
Input: Cell set C, KNN graph, max_shared threshold τ
Output: Metacell seeds S

1. Initialize: S = {}, available = C
2. While |available| > 0 and |S| < target:
   a. Randomly sample seed cell s from available
   b. Get neighbors: N_s = KNN(s) ∪ {s}
   c. For each existing seed t ∈ S:
      - Compute overlap: O_st = |N_s ∩ N_t|
      - If O_st > τ: reject s, goto step 2a
   d. Accept s: S = S ∪ {s}
   e. Update available (optional)
3. Return S

Step 3: Expression Aggregation

For each metacell mm with constituent cells CmC_m, the aggregated expression is:

Egm=1|Cm|iCmXgi E_{gm} = \frac{1}{|C_m|} \sum_{i \in C_m} X_{gi}

where XgiX_{gi} is the expression of gene gg in cell ii.

Metacell Quality Metrics

  • Overlap coefficient: Overlap(m1,m2)=|Cm1Cm2|k\text{Overlap}(m_1, m_2) = \frac{|C_{m_1} \cap C_{m_2}|}{k}
  • Density improvement: ρmetacell/ρsinglecell\rho_{metacell} / \rho_{single-cell}

Part 2: Weighted Gene Co-expression Network Construction

Correlation Matrix

The gene-gene correlation matrix SS is computed from the metacell expression matrix:

Sij=cor(Ei,Ej) S_{ij} = \text{cor}(E_{i\cdot}, E_{j\cdot})

where EiE_{i\cdot} represents the expression profile of gene ii across all metacells.

hdWGCNA supports multiple correlation methods: - Pearson correlation: Standard linear correlation - Bicor (biweight midcorrelation): Robust to outliers

Soft Power Thresholding

The adjacency matrix AA is computed using a soft power transformation:

Aij=|Sij|β A_{ij} = |S_{ij}|^\beta

where β\beta is the soft power threshold. This transformation emphasizes strong correlations while preserving weak ones.

Scale-Free Topology Criterion

The optimal β\beta is selected to approximate scale-free network topology. For a scale-free network, the connectivity distribution follows:

P(k)kγ P(k) \propto k^{-\gamma}

We assess scale-free topology fit using:

R2=corr(log(k),log(P(k)))2 R^2 = \text{corr}(\log(k), \log(P(k)))^2

A value of R20.8R^2 \geq 0.8 indicates good scale-free topology fit.

# Example: Testing soft powers
library(hdWGCNA)
seurat_obj <- TestSoftPowers(seurat_obj, powers = c(1:10, seq(12, 30, by=2)))

Topological Overlap Matrix (TOM)

The TOM measures interconnectedness considering shared neighbors:

TOMij=uAiuAuj+Aijmin(ki,kj)+1Aij \text{TOM}_{ij} = \frac{\sum_u A_{iu} A_{uj} + A_{ij}}{\min(k_i, k_j) + 1 - A_{ij}}

where ki=uAiuk_i = \sum_u A_{iu} is the connectivity of node ii.

The TOM-based dissimilarity is:

dij=1TOMij d_{ij} = 1 - \text{TOM}_{ij}


Part 3: Module Detection

Hierarchical Clustering

Genes are clustered using average linkage hierarchical clustering on the TOM-based dissimilarity matrix.

Dynamic Tree Cut

The Dynamic Tree Cut algorithm identifies modules by adaptively cutting the dendrogram. Key parameters:

Parameter Description Default
deepSplit Controls module granularity (0-4) 4
minModuleSize Minimum genes per module 50
mergeCutHeight Threshold for merging similar modules 0.2

Module Merging

Modules with highly correlated eigengenes are merged:

merge(M1,M2)ifcor(EM1,EM2)>1h \text{merge}(M_1, M_2) \quad \text{if} \quad \text{cor}(E_{M_1}, E_{M_2}) > 1 - h

where hh is the mergeCutHeight.


Part 4: Module Eigengenes

Definition

The module eigengene EME_M is the first principal component of the module’s expression matrix:

EM=𝐮1TXM E_M = \mathbf{u}_1^T \cdot X_M

where 𝐮1\mathbf{u}_1 is the first eigenvector of the covariance matrix of XMX_M.

Module Expression Score (kME)

The eigengene-based connectivity measures how well each gene correlates with the module eigengene:

kMEg,M=cor(Xg,EM) \text{kME}_{g,M} = \text{cor}(X_g, E_M)

Genes with high kME values are hub genes - central regulators of the module.


Part 5: Transcription Factor Network Analysis

XGBoost-Based Regulatory Inference

hdWGCNA uses gradient boosting (XGBoost) to infer TF-gene regulatory relationships:

ŷg=m=1Mfm(𝐱TF) \hat{y}_g = \sum_{m=1}^{M} f_m(\mathbf{x}_{TF})

where ŷg\hat{y}_g is the predicted expression of target gene gg, and fmf_m are regression trees trained on TF expression profiles.

Feature Importance

The importance of TF tt for predicting gene gg is computed as the average gain across all trees:

Gain(tg)=1Mm=1MGainm(t) \text{Gain}(t \rightarrow g) = \frac{1}{M} \sum_{m=1}^{M} \text{Gain}_m(t)

Regulon Definition

A regulon is the set of target genes for a TF. Three strategies are available:

Strategy Description
A Top nn TFs for each gene (gene-centric)
B Top nn genes for each TF (TF-centric)
C All pairs above threshold (permissive)

Part 6: Statistical Framework

Module Preservation

Module preservation statistics assess whether modules identified in one dataset are reproducible in another:

Zsummary Statistic

Zsummary=Zdensity+Zconnectivity2 Z_{summary} = \frac{Z_{density} + Z_{connectivity}}{2}

Interpretation: - Zsummary<2Z_{summary} < 2: No evidence of preservation - 2Zsummary<102 \leq Z_{summary} < 10: Weak to moderate preservation - Zsummary10Z_{summary} \geq 10: Strong preservation

Differential Module Eigengene (DME) Analysis

DME analysis compares module activity between conditions using standard statistical tests (Wilcoxon rank-sum, t-test, etc.):

H0:μEM|A=μEM|B H_0: \mu_{E_M|A} = \mu_{E_M|B}

Module-Trait Correlation

The correlation between module eigengenes and sample traits:

rM,T=cor(EM,T) r_{M,T} = \text{cor}(E_M, T)

with significance assessed via permutation or asymptotic t-test.


Computational Considerations

Memory Management

For large datasets, hdWGCNA uses: - Block-wise network construction - Disk caching of TOM matrices - Sparse matrix operations where possible

Parallelization

Key functions support parallel execution via: - BiocParallel for cross-platform parallelization - Internal chunking for memory-efficient processing


References

  1. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics (2008).

  2. Morabito S, et al. hdWGCNA identifies co-expression networks in high-dimensional transcriptomics data. Cell Reports Methods (2023).

  3. Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology (2005).

  4. Langfelder P, et al. Is My Network Module Preserved and Reproducible? PLoS Computational Biology (2011).


Session Information

## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] C
## 
## time zone: Asia/Shanghai
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.39     desc_1.4.3        R6_2.6.1          fastmap_1.2.0    
##  [5] xfun_0.56         cachem_1.1.0      knitr_1.51        htmltools_0.5.9  
##  [9] rmarkdown_2.30    lifecycle_1.0.5   cli_3.6.5         sass_0.4.10      
## [13] pkgdown_2.1.3     textshaping_1.0.4 jquerylib_0.1.4   systemfonts_1.3.1
## [17] compiler_4.4.0    tools_4.4.0       ragg_1.5.0        bslib_0.9.0      
## [21] evaluate_1.0.5    yaml_2.3.12       otel_0.2.0        jsonlite_2.0.0   
## [25] rlang_1.1.7       fs_1.6.6          htmlwidgets_1.6.4