Algorithm and Mathematical Framework

Introduction

TorchDecon implements a deep learning-based approach for cell type deconvolution, originally proposed by Menden et al. (2020) in their Scaden algorithm. This vignette provides a comprehensive overview of the mathematical framework and algorithmic principles underlying TorchDecon.

Author: Zaoqu Liu (liuzaoqu@163.com)

Problem Formulation

The Deconvolution Problem

Cell type deconvolution aims to estimate the cellular composition of bulk tissue samples. Given a bulk expression profile $\mathbf{x} \in \mathbb{R}^G$ (where $G$ is the number of genes), we seek to estimate the cell type fraction vector $\mathbf{f} \in \mathbb{R}^K$ (where $K$ is the number of cell types) such that:

$\sum_{k=1}^{K} f_k = 1, \quad f_k \geq 0 \quad \forall k$

Traditional Approaches vs. Deep Learning

Traditional deconvolution methods (e.g., CIBERSORT, MuSiC) rely on:

Signature matrices: Pre-defined gene expression signatures for each cell type
Linear mixing models: Assumption that bulk expression is a linear combination of cell type signatures

$\mathbf{x} = \mathbf{S} \cdot \mathbf{f} + \boldsymbol{\epsilon}$

where $\mathbf{S} \in \mathbb{R}^{G \times K}$ is the signature matrix.

TorchDecon’s deep learning approach instead learns a non-linear mapping directly from bulk expression to cell type fractions:

$\mathbf{f} = \mathcal{F}_\theta(\mathbf{x})$

where $\mathcal{F}_\theta$ is a neural network with parameters $\theta$ .

Algorithmic Pipeline

Overview

┌─────────────────────────────────────────────────────────────────┐
│                    TorchDecon Pipeline                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐      │
│  │  scRNA-seq   │───▶│    Bulk      │───▶│   Training   │      │
│  │  Reference   │    │  Simulation  │    │     Data     │      │
│  └──────────────┘    └──────────────┘    └──────────────┘      │
│                                                 │               │
│                                                 ▼               │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐      │
│  │  Cell Type   │◀───│   Trained    │◀───│   Training   │      │
│  │  Fractions   │    │   Ensemble   │    │    Loop      │      │
│  └──────────────┘    └──────────────┘    └──────────────┘      │
│         ▲                                                       │
│         │                                                       │
│  ┌──────────────┐                                              │
│  │  Real Bulk   │                                              │
│  │    Data      │                                              │
│  └──────────────┘                                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Step 1: Bulk RNA-seq Simulation

Mathematical Formulation

For each simulated sample $s$ :

Generate random fractions: Sample $\mathbf{f}^{(s)} \sim \text{Dirichlet}(\boldsymbol{\alpha})$ or uniform distribution, then normalize:

$f_k^{(s)} = \frac{u_k}{\sum_{j=1}^K u_j}, \quad u_k \sim \text{Uniform}(0, 1)$
Sample cells: For each cell type $k$ , sample $n_k = \lfloor f_k^{(s)} \cdot N \rfloor$ cells (with replacement), where $N$ is the total cells per sample.
Aggregate expression: Sum the expression across sampled cells:

$x_g^{(s)} = \sum_{k=1}^{K} \sum_{i \in C_k^{(s)}} c_{gi}$

where $C_k^{(s)}$ is the set of sampled cells from type $k$ , and $c_{gi}$ is the count for gene $g$ in cell $i$ .

Sparse Sample Generation

To improve model generalization, TorchDecon generates “sparse” samples where some cell types are absent:

$f_k^{(s)} = \begin{cases} \tilde{f}_k / \sum_{j \in \mathcal{A}} \tilde{f}_j & \text{if } k \in \mathcal{A} \\ 0 & \text{otherwise} \end{cases}$

where $\mathcal{A} \subset \{1, ..., K\}$ is a randomly selected subset of cell types.

Step 2: Data Preprocessing

Log Transformation

$\tilde{x}_g = \log_2(x_g + 1)$

This transformation: - Reduces the dynamic range of expression values - Stabilizes variance - Makes the data more normally distributed

Sample-wise Min-Max Normalization

For each sample $s$ :

$\hat{x}_g^{(s)} = \frac{\tilde{x}_g^{(s)} - \min_j(\tilde{x}_j^{(s)})}{\max_j(\tilde{x}_j^{(s)}) - \min_j(\tilde{x}_j^{(s)})}$

This ensures: - All features are in $[0, 1]$ - Sample-specific technical variations are mitigated - Neural network training stability is improved

Gene Filtering

Genes are filtered based on variance across samples:

$\text{Var}(x_g) = \frac{1}{n-1} \sum_{s=1}^{n} (x_g^{(s)} - \bar{x}_g)^2 > \tau$

where $\tau$ is the variance threshold (default: 0.1).

Step 3: Neural Network Architecture

Fully Connected Network

Each model in the ensemble is a fully connected neural network:

$\mathbf{h}^{(l)} = \sigma(\mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)})$

where: - $\mathbf{h}^{(l)}$ is the hidden representation at layer $l$ - $\mathbf{W}^{(l)}, \mathbf{b}^{(l)}$ are learnable weights and biases - $\sigma(\cdot)$ is the ReLU activation: $\sigma(z) = \max(0, z)$

Output Layer (Softmax)

The final layer uses softmax to ensure valid probability distribution:

$f_k = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}$

This guarantees: - $f_k \in (0, 1)$ for all $k$ - $\sum_k f_k = 1$

Dropout Regularization

During training, dropout randomly zeroes elements:

$\tilde{h}_i = \begin{cases} h_i / (1-p) & \text{with probability } 1-p \\ 0 & \text{with probability } p \end{cases}$

Step 4: Training

Loss Function

Mean Squared Error (MSE) between predicted and true fractions:

$\mathcal{L}(\theta) = \frac{1}{n \cdot K} \sum_{s=1}^{n} \sum_{k=1}^{K} (\hat{f}_k^{(s)} - f_k^{(s)})^2$

Adam Optimizer

Parameter updates using Adam:

$\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ \hat{m}_t &= m_t / (1 - \beta_1^t) \\ \hat{v}_t &= v_t / (1 - \beta_2^t) \\ \theta_{t+1} &= \theta_t - \eta \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) \end{aligned}$

Default hyperparameters: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\eta = 10^{-4}$

Step 5: Ensemble Prediction

Final prediction is the arithmetic mean across three models:

$\hat{\mathbf{f}} = \frac{1}{3} \sum_{m \in \{256, 512, 1024\}} \mathcal{F}_{\theta_m}(\mathbf{x})$

Ensemble benefits: - Reduced variance in predictions - Improved robustness to initialization - Better generalization

Architecture Specifications

Model	Layer Dimensions	Dropout Rates	Total Parameters*
M256	G → 256 → 128 → 64 → 32 → K	0, 0, 0, 0	~G×256 + 50K
M512	G → 512 → 256 → 128 → 64 → K	0, 0.3, 0.2, 0.1	~G×512 + 200K
M1024	G → 1024 → 512 → 256 → 128 → K	0, 0.6, 0.3, 0.1	~G×1024 + 800K

*Approximate; depends on number of genes (G) and cell types (K)

Theoretical Considerations

Universal Approximation

Deep neural networks are universal function approximators (Hornik, 1991). Given sufficient capacity, TorchDecon can theoretically approximate any continuous mapping from expression space to fraction space.

Advantages over Linear Models

Non-linear relationships: Can capture complex gene-cell type associations
Feature learning: Automatically learns relevant features from data
Scalability: Handles high-dimensional data efficiently
No signature matrix: Eliminates bias from pre-defined signatures

Limitations

Training data dependency: Performance depends on quality of simulated training data
Batch effects: May be sensitive to technical differences between reference and target data
Novel cell types: Cannot predict cell types not present in training data

References

Menden, K., et al. (2020). Deep learning-based cell composition analysis from tissue expression profiles. Science Advances, 6(30), eaba2619.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251-257.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR.

Package Author: Zaoqu Liu
Contact: liuzaoqu@163.com
GitHub: https://github.com/Zaoqu-Liu/TorchDecon

Zaoqu Liu

2026-01-26