Statistical Framework and Algorithm Details

Introduction

This vignette provides a detailed explanation of the statistical framework underlying FastCCCR, developed by Zaoqu Liu. Understanding these principles will help users interpret results and make informed methodological choices.

library(FastCCCR)
library(data.table)

The Cell-Cell Communication Problem

Mathematical Formulation

Given single-cell RNA-seq data with:

$N$ cells with expression profiles $\mathbf{x}_1, \ldots, \mathbf{x}_N$
$K$ cell types with $n_k$ cells in cell type $k$
$M$ ligand-receptor interaction pairs $(L_m, R_m)$

The goal is to identify significant interactions between cell type pairs $(i, j)$ based on ligand expression in $i$ and receptor expression in $j$ .

Interaction Score

For a cell type pair $(i, j)$ and interaction $(L, R)$ , the interaction score is defined as:

$S_{ij} = \frac{\bar{X}_L^{(i)} + \bar{X}_R^{(j)}}{2} \cdot \mathbf{1}[\bar{X}_L^{(i)} > 0] \cdot \mathbf{1}[\bar{X}_R^{(j)} > 0]$

where $\bar{X}_L^{(i)}$ is the mean expression of ligand $L$ in cell type $i$ .

Null Distribution Computation

The Key Innovation

Traditional methods use permutation tests, which are computationally expensive. FastCCCR computes exact null distributions using the following key insight:

Under the null hypothesis that cell type labels are random, the cluster mean follows the distribution of the average of $n$ i.i.d. random variables drawn from the empirical single-cell distribution.

FFT Convolution

For a cluster with $n$ cells, let $F$ be the empirical PMF of single-cell expression. The PMF of the cluster sum is:

$P(S_n = k) = \underbrace{F * F * \cdots * F}_{n \text{ times}}(k)$

where $*$ denotes convolution. This is computed efficiently using the Fast Fourier Transform:

$F^{*n} = \mathcal{F}^{-1}\left[\mathcal{F}(F)^n\right]$

Implementation Example

# Demonstrate convolution for cluster mean distribution
set.seed(42)

# Simulate single-cell expression (digitized 0-50)
n_cells <- 1000
single_cell_expr <- sample(0:50, n_cells, replace = TRUE, 
                           prob = c(0.3, rep(0.7/50, 50)))

# Create base distribution
base_dist <- Distribution$new(
  dtype = "other",
  samples = single_cell_expr,
  mode = "digit"
)

cat("Base distribution:\n")
#> Base distribution:
cat("  Mean:", base_dist$get_mean(), "\n")
#>   Mean: 0.17476
cat("  SD:", base_dist$get_std(), "\n")
#>   SD: 0.1694944

# Compute mean distribution for n=10 cells
n <- 10L
sum_dist <- base_dist$power(n)
mean_dist <- sum_dist$divide(n)

cat("\nMean distribution (n=10):\n")
#> 
#> Mean distribution (n=10):
cat("  Mean:", mean_dist$get_mean(), "\n")
#>   Mean: 0.17926
cat("  SD:", mean_dist$get_std(), "\n")
#>   SD: 0.05367583
cat("  Theoretical SD:", base_dist$get_std() / sqrt(n), "\n")
#>   Theoretical SD: 0.05359883

Central Limit Theorem Approximation

For large $n$ , the cluster mean approximately follows a normal distribution:

$\bar{X}_n \sim \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right)$

FastCCCR automatically switches to CLT approximation when $n \geq 100$ for computational efficiency.

# CLT kicks in for large n
large_n <- 200L
clt_dist <- base_dist$power(large_n)$divide(large_n)

cat("CLT approximation (n=200):\n")
#> CLT approximation (n=200):
cat("  Is normal type:", clt_dist$is_normal_type, "\n")
#>   Is normal type: TRUE
cat("  Mean:", clt_dist$get_mean(), "\n")
#>   Mean: 0.17476
cat("  SD:", clt_dist$get_std(), "\n")
#>   SD: 0.01198506

Protein Complex Handling

Minimum Aggregation

For a protein complex with subunits $P_1, P_2, \ldots, P_k$ , the complex expression is:

$X_{\text{complex}} = \min(X_{P_1}, X_{P_2}, \ldots, X_{P_k})$

Distribution of the Minimum

Given independent random variables $X_1, \ldots, X_k$ with CDFs $F_1, \ldots, F_k$ :

$F_{\min}(x) = 1 - \prod_{i=1}^{k}(1 - F_i(x))$

# Demonstrate minimum distribution
d1 <- Distribution$new(dtype = "normal", loc = 10, scale = 2)
d2 <- Distribution$new(dtype = "normal", loc = 15, scale = 3)
d3 <- Distribution$new(dtype = "normal", loc = 12, scale = 2.5)

min_dist <- get_minimum_distribution(list(d1, d2, d3))

cat("Minimum of three normal distributions:\n")
#> Minimum of three normal distributions:
cat("  Input means: 10, 15, 12\n")
#>   Input means: 10, 15, 12
cat("  Minimum mean:", round(min_dist$get_mean(), 2), "\n")
#>   Minimum mean: 9.39
cat("  (Should be close to or less than 10)\n")
#>   (Should be close to or less than 10)

The AND Operation for L-R Combination

Mathematical Definition

The interaction requires both ligand and receptor to be expressed. The AND operation combines two distributions as:

$P(X \land Y = 0) = P(X = 0) + P(Y = 0) - P(X = 0)P(Y = 0)$

For non-zero values:

$P(X \land Y = k) = (f_X[1:] * f_Y[1:])(k-2), \quad k \geq 2$

# Demonstrate AND operation
d1 <- Distribution$new(
  dtype = "other",
  pmf_array = c(0.3, 0.2, 0.2, 0.15, 0.15),
  is_align = TRUE,
  mode = "digit"
)

d2 <- Distribution$new(
  dtype = "other",
  pmf_array = c(0.2, 0.3, 0.25, 0.15, 0.1),
  is_align = TRUE,
  mode = "digit"
)

d_and <- d1$and_op(d2)

cat("AND operation:\n")
#> AND operation:
cat("  P(X=0):", 0.3, "\n")
#>   P(X=0): 0.3
cat("  P(Y=0):", 0.2, "\n")
#>   P(Y=0): 0.2
cat("  P(X∧Y=0):", d_and$get_pmf_array()[1], "\n")
#>   P(X<U+2227>Y=0): 0.44
cat("  Theoretical:", 0.3 + 0.2 - 0.3*0.2, "\n")
#>   Theoretical: 0.44

Cauchy Combination Test

Motivation

Different statistical configurations (e.g., Mean vs Quantile, Minimum vs Average) may yield different p-values. The Cauchy combination test provides a principled way to combine these.

Mathematical Formulation

Given p-values $p_1, \ldots, p_K$ from $K$ methods with weights $w_1, \ldots, w_K$ :

$T = \sum_{i=1}^{K} w_i \tan\left(\pi(0.5 - p_i)\right)$

The combined p-value is:

$p_{\text{combined}} = 0.5 - \frac{\arctan(T)}{\pi}$

Properties

Robust to dependence: Works well even when p-values are correlated
Heavy-tailed: More sensitive to small p-values than Fisher’s method
Computationally efficient: O(K) complexity

# Demonstrate Cauchy combination
# Simulate p-values from 4 methods for 5 interactions
set.seed(42)

pval_methods <- lapply(1:4, function(i) {
  data.table(
    pair = paste0("CT", rep(1:3, each = 3), "|CT", rep(1:3, 3)),
    int1 = runif(9, 0, 0.1),
    int2 = runif(9, 0, 0.3),
    int3 = runif(9, 0.05, 0.5)
  )
})

combined <- cauchy_combine(pval_methods)

cat("Cauchy combination example:\n")
#> Cauchy combination example:
cat("Input p-values (method 1, int1):", round(pval_methods[[1]]$int1[1:3], 4), "\n")
#> Input p-values (method 1, int1): 0.0915 0.0937 0.0286
cat("Input p-values (method 2, int1):", round(pval_methods[[2]]$int1[1:3], 4), "\n")
#> Input p-values (method 2, int1): 0.0906 0.0447 0.0836
cat("Combined p-values (int1):", round(combined$int1[1:3], 4), "\n")
#> Combined p-values (int1): 0.0117 0.054 0.0519

P-value Computation

From Distribution to P-value

Given an observed interaction score $s$ and null distribution $F_0$ :

$p = 1 - F_0(s) = P(X > s | H_0)$

For analytic distributions (normal), this is computed exactly. For PMF-based distributions, it uses the CDF.

# P-value computation examples
null_dist <- Distribution$new(dtype = "normal", loc = 5, scale = 2)

observed_values <- c(5, 7, 9, 11)
for (v in observed_values) {
  pval <- get_pvalue(v, null_dist)
  cat(sprintf("Observed: %.1f, P-value: %.4f\n", v, pval))
}
#> Observed: 5.0, P-value: 0.5000
#> Observed: 7.0, P-value: 0.1587
#> Observed: 9.0, P-value: 0.0228
#> Observed: 11.0, P-value: 0.0013

Computational Complexity

Operation	Complexity	Notes
FFT convolution	$O(n \log n)$	For PMF of length $n$
Sum of $k$ distributions	$O(k \cdot n \log n)$	Using repeated convolution
Minimum of $k$ distributions	$O(k \cdot n)$	Element-wise CDF operations
Cauchy combination	$O(K)$	For $K$ p-values
Full analysis	$O(G \cdot C^2 \cdot I)$	$G$ genes, $C$ cell types, $I$ interactions

Summary of Statistical Methods

FastCCCR supports multiple configurations:

Component	Options	Default
Single-unit summary	Mean, Median, Q3, Quantile_q	Mean
Complex aggregation	Minimum, Average	Minimum
L-R combination	Arithmetic, Geometric	Arithmetic
P-value combination	Cauchy	-

The Cauchy combination mode runs all 8 combinations (2×2×2) and combines the results.

References

Liu, Y., et al. (2022). “A Cauchy combination test for aggregating p-values.” Biometrics.
Efremova, M., et al. (2020). “CellPhoneDB: inferring cell–cell communication from combined expression of multi-subunit ligand–receptor complexes.” Nature Protocols.

Session Information

sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.18.0 FastCCCR_1.0.0   
#> 
#> loaded via a namespace (and not attached):
#>  [1] cli_3.6.5         knitr_1.51        rlang_1.1.7       xfun_0.56        
#>  [5] otel_0.2.0        textshaping_1.0.4 jsonlite_2.0.0    listenv_0.10.0   
#>  [9] htmltools_0.5.9   ragg_1.5.0        sass_0.4.10       rmarkdown_2.30   
#> [13] grid_4.4.0        evaluate_1.0.5    jquerylib_0.1.4   fastmap_1.2.0    
#> [17] yaml_2.3.12       lifecycle_1.0.5   compiler_4.4.0    codetools_0.2-20 
#> [21] fs_1.6.6          Rcpp_1.1.1        htmlwidgets_1.6.4 future_1.69.0    
#> [25] lattice_0.22-7    systemfonts_1.3.1 digest_0.6.39     R6_2.6.1         
#> [29] parallelly_1.46.1 parallel_4.4.0    bslib_0.9.0       Matrix_1.7-4     
#> [33] tools_4.4.0       globals_0.18.0    pkgdown_2.1.3     cachem_1.1.0     
#> [37] desc_1.4.3

Zaoqu Liu

2026-01-26

Introduction

The Cell-Cell Communication Problem

Mathematical Formulation

Interaction Score

Null Distribution Computation

The Key Innovation

FFT Convolution

Implementation Example

Central Limit Theorem Approximation

Protein Complex Handling

Minimum Aggregation

Distribution of the Minimum

The AND Operation for L-R Combination

Mathematical Definition

Cauchy Combination Test

Motivation

Mathematical Formulation

Properties

P-value Computation

From Distribution to P-value

Computational Complexity

Summary of Statistical Methods

References

Session Information