Skip to contents

Introduction

CellOracleR implements a mathematical framework for predicting cell state transitions following transcription factor (TF) perturbations. This vignette describes the core algorithms underlying the package.

1. Gene Regulatory Network (GRN) Inference

1.1 Mathematical Formulation

For each target gene gg, we model its expression as a linear function of its regulators:

Xg=rRgβr,gXr+ϵg X_g = \sum_{r \in R_g} \beta_{r,g} \cdot X_r + \epsilon_g

where:

  • XgX_g is the expression of target gene gg
  • RgR_g is the set of regulators for gene gg
  • βr,g\beta_{r,g} is the regulatory coefficient from regulator rr to target gg
  • ϵg\epsilon_g is the residual error

1.2 Ridge Regression

We use Ridge regression (L2 regularization) to estimate β\beta coefficients:

β̂=argminβ{XgXRβ22+αβ22} \hat{\beta} = \arg\min_\beta \left\{ \|X_g - X_R \beta\|_2^2 + \alpha \|\beta\|_2^2 \right\}

The closed-form solution is:

β̂=(XRTXR+αI)1XRTXg \hat{\beta} = (X_R^T X_R + \alpha I)^{-1} X_R^T X_g

Why Ridge Regression?

  • Prevents overfitting: Regularization shrinks coefficients toward zero
  • Handles multicollinearity: Correlated regulators don’t cause instability
  • Numerical stability: Always has a unique solution
# Ridge regression in CellOracleR
coef_matrix <- fit_grn_coef_matrix(
  gem = expression_matrix,
  TFdict = tf_target_dict,
  alpha = 10  # regularization strength
)

1.3 Bootstrap Aggregation (Bagging)

To improve robustness, we employ bagging:

  1. Sample 80% of cells (without replacement)
  2. Fit Ridge regression on the subsample
  3. Repeat BB times (typically 200)
  4. Use median of coefficients as final estimate

β̂final=median(β̂1,β̂2,,β̂B) \hat{\beta}_{final} = \text{median}(\hat{\beta}_1, \hat{\beta}_2, \ldots, \hat{\beta}_B)

Advantages:

  • Reduces variance of coefficient estimates
  • More robust to outlier cells
  • Provides coefficient stability measures

2. Signal Propagation Simulation

2.1 Perturbation Model

Given a perturbation condition (e.g., TF knockout), we simulate the downstream effects:

ΔX(0)=XperturbedXoriginal \Delta X^{(0)} = X_{perturbed} - X_{original}

2.2 Iterative Propagation

The signal propagates through the GRN iteratively:

ΔX(t+1)=ΔX(t)W \Delta X^{(t+1)} = \Delta X^{(t)} \cdot W

where WW is the coefficient matrix (genes × genes).

Key constraint: Perturbed gene values are maintained at each step:

ΔXp(t+1)=ΔXp(0)for all perturbed genes p \Delta X^{(t+1)}_p = \Delta X^{(0)}_p \quad \text{for all perturbed genes } p

2.3 Non-negativity Constraint

Gene expression cannot be negative:

Xsimulated=max(0,Xoriginal+ΔX) X_{simulated} = \max(0, X_{original} + \Delta X)

Signal propagation through GRN

Signal propagation through GRN

3. Transition Probability Estimation

3.1 Correlation-Based Approach

We estimate the probability of cell ii transitioning to cell jj based on:

ρij=cor(ΔXi,XjXi) \rho_{ij} = \text{cor}(\Delta X_i, X_j - X_i)

This measures how well the predicted expression change aligns with the direction toward cell jj.

3.2 Probability Conversion

Correlations are converted to probabilities using an exponential kernel:

Pij=exp(ρij/σ)kNiexp(ρik/σ) P_{ij} = \frac{\exp(\rho_{ij} / \sigma)}{\sum_{k \in N_i} \exp(\rho_{ik} / \sigma)}

where:

  • NiN_i is the set of k-nearest neighbors of cell ii
  • σ\sigma is a bandwidth parameter (default: 0.05)

3.3 Embedding Shift Calculation

The expected movement in embedding space:

ΔEi=jNiPijd̂ij \Delta E_i = \sum_{j \in N_i} P_{ij} \cdot \hat{d}_{ij}

where d̂ij\hat{d}_{ij} is the unit vector from cell ii to cell jj in embedding space.

4. Markov Chain Simulation

4.1 State Transition Model

Cell fate is modeled as a discrete-time Markov chain:

P(St+1=j|St=i)=Pij P(S_{t+1} = j | S_t = i) = P_{ij}

4.2 Trajectory Simulation

For each starting cell, we simulate NN steps:

Algorithm: Markov Walk
Input: transition_prob P, start_cell s, n_steps T
Output: trajectory [s_0, s_1, ..., s_T]

s_0 ← s
for t = 1 to T:
    sample s_t from Categorical(P[s_{t-1}, :])
return [s_0, s_1, ..., s_T]

4.3 Fate Probability

For absorbing Markov chains, we can compute the probability of reaching each terminal state:

F=(IQ)1R F = (I - Q)^{-1} R

where QQ is the transient-to-transient transition matrix and RR is the transient-to-absorbing matrix.

5. Network Analysis Metrics

5.1 Centrality Measures

Degree centrality: CD(v)=deg(v)=|N(v)| C_D(v) = \text{deg}(v) = |N(v)|

Betweenness centrality: CB(v)=svtσst(v)σst C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}}

Eigenvector centrality: CE(v)=1λuN(v)CE(u) C_E(v) = \frac{1}{\lambda} \sum_{u \in N(v)} C_E(u)

5.2 Network Entropy

Degree distribution entropy:

H=kp(k)log2p(k) H = -\sum_{k} p(k) \log_2 p(k)

Higher entropy indicates more heterogeneous connectivity.

Summary

CellOracleR combines:

  1. Ridge regression for robust GRN inference
  2. Signal propagation for perturbation simulation
  3. Correlation-based transition probabilities for fate prediction
  4. Markov chains for trajectory modeling
  5. Graph theory for network analysis

These mathematical foundations enable accurate prediction of cellular responses to genetic perturbations.

References

  1. Kamimoto, K., et al. (2023). CellOracle: Dissecting cell identity via network inference and in silico gene perturbation. Molecular Systems Biology.

  2. Hoerl, A.E. & Kennard, R.W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics.

  3. Breiman, L. (1996). Bagging Predictors. Machine Learning.

Session Info

sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_4.0.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6       jsonlite_2.0.0     dplyr_1.1.4        compiler_4.4.0    
#>  [5] tidyselect_1.2.1   dichromat_2.0-0.1  jquerylib_0.1.4    systemfonts_1.3.1 
#>  [9] scales_1.4.0       textshaping_1.0.4  yaml_2.3.12        fastmap_1.2.0     
#> [13] R6_2.6.1           labeling_0.4.3     generics_0.1.4     knitr_1.51        
#> [17] htmlwidgets_1.6.4  tibble_3.3.1       desc_1.4.3         bslib_0.9.0       
#> [21] pillar_1.11.1      RColorBrewer_1.1-3 rlang_1.1.7        cachem_1.1.0      
#> [25] xfun_0.56          fs_1.6.6           sass_0.4.10        S7_0.2.1          
#> [29] otel_0.2.0         cli_3.6.5          pkgdown_2.1.3      withr_3.0.2       
#> [33] magrittr_2.0.4     digest_0.6.39      grid_4.4.0         lifecycle_1.0.5   
#> [37] vctrs_0.7.1        evaluate_1.0.5     glue_1.8.0         farver_2.1.2      
#> [41] ragg_1.5.0         rmarkdown_2.30     tools_4.4.0        pkgconfig_2.0.3   
#> [45] htmltools_0.5.9