Statistical Framework and Algorithm Details
Zaoqu Liu
2026-01-26
Source:vignettes/algorithm.Rmd
algorithm.RmdIntroduction
This vignette provides a detailed explanation of the statistical framework underlying FastCCCR, developed by Zaoqu Liu. Understanding these principles will help users interpret results and make informed methodological choices.
The Cell-Cell Communication Problem
Null Distribution Computation
The Key Innovation
Traditional methods use permutation tests, which are computationally expensive. FastCCCR computes exact null distributions using the following key insight:
Under the null hypothesis that cell type labels are random, the cluster mean follows the distribution of the average of i.i.d. random variables drawn from the empirical single-cell distribution.
FFT Convolution
For a cluster with cells, let be the empirical PMF of single-cell expression. The PMF of the cluster sum is:
where denotes convolution. This is computed efficiently using the Fast Fourier Transform:
Implementation Example
# Demonstrate convolution for cluster mean distribution
set.seed(42)
# Simulate single-cell expression (digitized 0-50)
n_cells <- 1000
single_cell_expr <- sample(0:50, n_cells, replace = TRUE,
prob = c(0.3, rep(0.7/50, 50)))
# Create base distribution
base_dist <- Distribution$new(
dtype = "other",
samples = single_cell_expr,
mode = "digit"
)
cat("Base distribution:\n")
#> Base distribution:
cat(" Mean:", base_dist$get_mean(), "\n")
#> Mean: 0.17476
cat(" SD:", base_dist$get_std(), "\n")
#> SD: 0.1694944
# Compute mean distribution for n=10 cells
n <- 10L
sum_dist <- base_dist$power(n)
mean_dist <- sum_dist$divide(n)
cat("\nMean distribution (n=10):\n")
#>
#> Mean distribution (n=10):
cat(" Mean:", mean_dist$get_mean(), "\n")
#> Mean: 0.17926
cat(" SD:", mean_dist$get_std(), "\n")
#> SD: 0.05367583
cat(" Theoretical SD:", base_dist$get_std() / sqrt(n), "\n")
#> Theoretical SD: 0.05359883Central Limit Theorem Approximation
For large , the cluster mean approximately follows a normal distribution:
FastCCCR automatically switches to CLT approximation when for computational efficiency.
# CLT kicks in for large n
large_n <- 200L
clt_dist <- base_dist$power(large_n)$divide(large_n)
cat("CLT approximation (n=200):\n")
#> CLT approximation (n=200):
cat(" Is normal type:", clt_dist$is_normal_type, "\n")
#> Is normal type: TRUE
cat(" Mean:", clt_dist$get_mean(), "\n")
#> Mean: 0.17476
cat(" SD:", clt_dist$get_std(), "\n")
#> SD: 0.01198506Protein Complex Handling
Distribution of the Minimum
Given independent random variables with CDFs :
# Demonstrate minimum distribution
d1 <- Distribution$new(dtype = "normal", loc = 10, scale = 2)
d2 <- Distribution$new(dtype = "normal", loc = 15, scale = 3)
d3 <- Distribution$new(dtype = "normal", loc = 12, scale = 2.5)
min_dist <- get_minimum_distribution(list(d1, d2, d3))
cat("Minimum of three normal distributions:\n")
#> Minimum of three normal distributions:
cat(" Input means: 10, 15, 12\n")
#> Input means: 10, 15, 12
cat(" Minimum mean:", round(min_dist$get_mean(), 2), "\n")
#> Minimum mean: 9.39
cat(" (Should be close to or less than 10)\n")
#> (Should be close to or less than 10)The AND Operation for L-R Combination
Mathematical Definition
The interaction requires both ligand and receptor to be expressed. The AND operation combines two distributions as:
For non-zero values:
# Demonstrate AND operation
d1 <- Distribution$new(
dtype = "other",
pmf_array = c(0.3, 0.2, 0.2, 0.15, 0.15),
is_align = TRUE,
mode = "digit"
)
d2 <- Distribution$new(
dtype = "other",
pmf_array = c(0.2, 0.3, 0.25, 0.15, 0.1),
is_align = TRUE,
mode = "digit"
)
d_and <- d1$and_op(d2)
cat("AND operation:\n")
#> AND operation:
cat(" P(X=0):", 0.3, "\n")
#> P(X=0): 0.3
cat(" P(Y=0):", 0.2, "\n")
#> P(Y=0): 0.2
cat(" P(X∧Y=0):", d_and$get_pmf_array()[1], "\n")
#> P(X<U+2227>Y=0): 0.44
cat(" Theoretical:", 0.3 + 0.2 - 0.3*0.2, "\n")
#> Theoretical: 0.44Cauchy Combination Test
Motivation
Different statistical configurations (e.g., Mean vs Quantile, Minimum vs Average) may yield different p-values. The Cauchy combination test provides a principled way to combine these.
Properties
- Robust to dependence: Works well even when p-values are correlated
- Heavy-tailed: More sensitive to small p-values than Fisher’s method
- Computationally efficient: O(K) complexity
# Demonstrate Cauchy combination
# Simulate p-values from 4 methods for 5 interactions
set.seed(42)
pval_methods <- lapply(1:4, function(i) {
data.table(
pair = paste0("CT", rep(1:3, each = 3), "|CT", rep(1:3, 3)),
int1 = runif(9, 0, 0.1),
int2 = runif(9, 0, 0.3),
int3 = runif(9, 0.05, 0.5)
)
})
combined <- cauchy_combine(pval_methods)
cat("Cauchy combination example:\n")
#> Cauchy combination example:
cat("Input p-values (method 1, int1):", round(pval_methods[[1]]$int1[1:3], 4), "\n")
#> Input p-values (method 1, int1): 0.0915 0.0937 0.0286
cat("Input p-values (method 2, int1):", round(pval_methods[[2]]$int1[1:3], 4), "\n")
#> Input p-values (method 2, int1): 0.0906 0.0447 0.0836
cat("Combined p-values (int1):", round(combined$int1[1:3], 4), "\n")
#> Combined p-values (int1): 0.0117 0.054 0.0519P-value Computation
From Distribution to P-value
Given an observed interaction score and null distribution :
For analytic distributions (normal), this is computed exactly. For PMF-based distributions, it uses the CDF.
# P-value computation examples
null_dist <- Distribution$new(dtype = "normal", loc = 5, scale = 2)
observed_values <- c(5, 7, 9, 11)
for (v in observed_values) {
pval <- get_pvalue(v, null_dist)
cat(sprintf("Observed: %.1f, P-value: %.4f\n", v, pval))
}
#> Observed: 5.0, P-value: 0.5000
#> Observed: 7.0, P-value: 0.1587
#> Observed: 9.0, P-value: 0.0228
#> Observed: 11.0, P-value: 0.0013Computational Complexity
| Operation | Complexity | Notes |
|---|---|---|
| FFT convolution | For PMF of length | |
| Sum of distributions | Using repeated convolution | |
| Minimum of distributions | Element-wise CDF operations | |
| Cauchy combination | For p-values | |
| Full analysis | genes, cell types, interactions |
Summary of Statistical Methods
FastCCCR supports multiple configurations:
| Component | Options | Default |
|---|---|---|
| Single-unit summary | Mean, Median, Q3, Quantile_q | Mean |
| Complex aggregation | Minimum, Average | Minimum |
| L-R combination | Arithmetic, Geometric | Arithmetic |
| P-value combination | Cauchy | - |
The Cauchy combination mode runs all 8 combinations (2×2×2) and combines the results.
References
- Liu, Y., et al. (2022). “A Cauchy combination test for aggregating p-values.” Biometrics.
- Efremova, M., et al. (2020). “CellPhoneDB: inferring cell–cell communication from combined expression of multi-subunit ligand–receptor complexes.” Nature Protocols.
Session Information
sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C
#>
#> time zone: Asia/Shanghai
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] data.table_1.18.0 FastCCCR_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] cli_3.6.5 knitr_1.51 rlang_1.1.7 xfun_0.56
#> [5] otel_0.2.0 textshaping_1.0.4 jsonlite_2.0.0 listenv_0.10.0
#> [9] htmltools_0.5.9 ragg_1.5.0 sass_0.4.10 rmarkdown_2.30
#> [13] grid_4.4.0 evaluate_1.0.5 jquerylib_0.1.4 fastmap_1.2.0
#> [17] yaml_2.3.12 lifecycle_1.0.5 compiler_4.4.0 codetools_0.2-20
#> [21] fs_1.6.6 Rcpp_1.1.1 htmlwidgets_1.6.4 future_1.69.0
#> [25] lattice_0.22-7 systemfonts_1.3.1 digest_0.6.39 R6_2.6.1
#> [29] parallelly_1.46.1 parallel_4.4.0 bslib_0.9.0 Matrix_1.7-4
#> [33] tools_4.4.0 globals_0.18.0 pkgdown_2.1.3 cachem_1.1.0
#> [37] desc_1.4.3