Skip to contents

Introduction

This vignette provides a detailed explanation of the statistical framework underlying FastCCCR, developed by Zaoqu Liu. Understanding these principles will help users interpret results and make informed methodological choices.

The Cell-Cell Communication Problem

Mathematical Formulation

Given single-cell RNA-seq data with:

  • NN cells with expression profiles 𝐱1,,𝐱N\mathbf{x}_1, \ldots, \mathbf{x}_N
  • KK cell types with nkn_k cells in cell type kk
  • MM ligand-receptor interaction pairs (Lm,Rm)(L_m, R_m)

The goal is to identify significant interactions between cell type pairs (i,j)(i, j) based on ligand expression in ii and receptor expression in jj.

Interaction Score

For a cell type pair (i,j)(i, j) and interaction (L,R)(L, R), the interaction score is defined as:

Sij=XL(i)+XR(j)2𝟏[XL(i)>0]𝟏[XR(j)>0]S_{ij} = \frac{\bar{X}_L^{(i)} + \bar{X}_R^{(j)}}{2} \cdot \mathbf{1}[\bar{X}_L^{(i)} > 0] \cdot \mathbf{1}[\bar{X}_R^{(j)} > 0]

where XL(i)\bar{X}_L^{(i)} is the mean expression of ligand LL in cell type ii.

Null Distribution Computation

The Key Innovation

Traditional methods use permutation tests, which are computationally expensive. FastCCCR computes exact null distributions using the following key insight:

Under the null hypothesis that cell type labels are random, the cluster mean follows the distribution of the average of nn i.i.d. random variables drawn from the empirical single-cell distribution.

FFT Convolution

For a cluster with nn cells, let FF be the empirical PMF of single-cell expression. The PMF of the cluster sum is:

P(Sn=k)=F*F**Fn times(k)P(S_n = k) = \underbrace{F * F * \cdots * F}_{n \text{ times}}(k)

where ** denotes convolution. This is computed efficiently using the Fast Fourier Transform:

F*n=1[(F)n]F^{*n} = \mathcal{F}^{-1}\left[\mathcal{F}(F)^n\right]

Implementation Example

# Demonstrate convolution for cluster mean distribution
set.seed(42)

# Simulate single-cell expression (digitized 0-50)
n_cells <- 1000
single_cell_expr <- sample(0:50, n_cells, replace = TRUE, 
                           prob = c(0.3, rep(0.7/50, 50)))

# Create base distribution
base_dist <- Distribution$new(
  dtype = "other",
  samples = single_cell_expr,
  mode = "digit"
)

cat("Base distribution:\n")
#> Base distribution:
cat("  Mean:", base_dist$get_mean(), "\n")
#>   Mean: 0.17476
cat("  SD:", base_dist$get_std(), "\n")
#>   SD: 0.1694944

# Compute mean distribution for n=10 cells
n <- 10L
sum_dist <- base_dist$power(n)
mean_dist <- sum_dist$divide(n)

cat("\nMean distribution (n=10):\n")
#> 
#> Mean distribution (n=10):
cat("  Mean:", mean_dist$get_mean(), "\n")
#>   Mean: 0.17926
cat("  SD:", mean_dist$get_std(), "\n")
#>   SD: 0.05367583
cat("  Theoretical SD:", base_dist$get_std() / sqrt(n), "\n")
#>   Theoretical SD: 0.05359883

Central Limit Theorem Approximation

For large nn, the cluster mean approximately follows a normal distribution:

Xn𝒩(μ,σ2n)\bar{X}_n \sim \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right)

FastCCCR automatically switches to CLT approximation when n100n \geq 100 for computational efficiency.

# CLT kicks in for large n
large_n <- 200L
clt_dist <- base_dist$power(large_n)$divide(large_n)

cat("CLT approximation (n=200):\n")
#> CLT approximation (n=200):
cat("  Is normal type:", clt_dist$is_normal_type, "\n")
#>   Is normal type: TRUE
cat("  Mean:", clt_dist$get_mean(), "\n")
#>   Mean: 0.17476
cat("  SD:", clt_dist$get_std(), "\n")
#>   SD: 0.01198506

Protein Complex Handling

Minimum Aggregation

For a protein complex with subunits P1,P2,,PkP_1, P_2, \ldots, P_k, the complex expression is:

Xcomplex=min(XP1,XP2,,XPk)X_{\text{complex}} = \min(X_{P_1}, X_{P_2}, \ldots, X_{P_k})

Distribution of the Minimum

Given independent random variables X1,,XkX_1, \ldots, X_k with CDFs F1,,FkF_1, \ldots, F_k:

Fmin(x)=1i=1k(1Fi(x))F_{\min}(x) = 1 - \prod_{i=1}^{k}(1 - F_i(x))

# Demonstrate minimum distribution
d1 <- Distribution$new(dtype = "normal", loc = 10, scale = 2)
d2 <- Distribution$new(dtype = "normal", loc = 15, scale = 3)
d3 <- Distribution$new(dtype = "normal", loc = 12, scale = 2.5)

min_dist <- get_minimum_distribution(list(d1, d2, d3))

cat("Minimum of three normal distributions:\n")
#> Minimum of three normal distributions:
cat("  Input means: 10, 15, 12\n")
#>   Input means: 10, 15, 12
cat("  Minimum mean:", round(min_dist$get_mean(), 2), "\n")
#>   Minimum mean: 9.39
cat("  (Should be close to or less than 10)\n")
#>   (Should be close to or less than 10)

The AND Operation for L-R Combination

Mathematical Definition

The interaction requires both ligand and receptor to be expressed. The AND operation combines two distributions as:

P(XY=0)=P(X=0)+P(Y=0)P(X=0)P(Y=0)P(X \land Y = 0) = P(X = 0) + P(Y = 0) - P(X = 0)P(Y = 0)

For non-zero values:

P(XY=k)=(fX[1:]*fY[1:])(k2),k2P(X \land Y = k) = (f_X[1:] * f_Y[1:])(k-2), \quad k \geq 2

# Demonstrate AND operation
d1 <- Distribution$new(
  dtype = "other",
  pmf_array = c(0.3, 0.2, 0.2, 0.15, 0.15),
  is_align = TRUE,
  mode = "digit"
)

d2 <- Distribution$new(
  dtype = "other",
  pmf_array = c(0.2, 0.3, 0.25, 0.15, 0.1),
  is_align = TRUE,
  mode = "digit"
)

d_and <- d1$and_op(d2)

cat("AND operation:\n")
#> AND operation:
cat("  P(X=0):", 0.3, "\n")
#>   P(X=0): 0.3
cat("  P(Y=0):", 0.2, "\n")
#>   P(Y=0): 0.2
cat("  P(X∧Y=0):", d_and$get_pmf_array()[1], "\n")
#>   P(X<U+2227>Y=0): 0.44
cat("  Theoretical:", 0.3 + 0.2 - 0.3*0.2, "\n")
#>   Theoretical: 0.44

Cauchy Combination Test

Motivation

Different statistical configurations (e.g., Mean vs Quantile, Minimum vs Average) may yield different p-values. The Cauchy combination test provides a principled way to combine these.

Mathematical Formulation

Given p-values p1,,pKp_1, \ldots, p_K from KK methods with weights w1,,wKw_1, \ldots, w_K:

T=i=1Kwitan(π(0.5pi))T = \sum_{i=1}^{K} w_i \tan\left(\pi(0.5 - p_i)\right)

The combined p-value is:

pcombined=0.5arctan(T)πp_{\text{combined}} = 0.5 - \frac{\arctan(T)}{\pi}

Properties

  1. Robust to dependence: Works well even when p-values are correlated
  2. Heavy-tailed: More sensitive to small p-values than Fisher’s method
  3. Computationally efficient: O(K) complexity
# Demonstrate Cauchy combination
# Simulate p-values from 4 methods for 5 interactions
set.seed(42)

pval_methods <- lapply(1:4, function(i) {
  data.table(
    pair = paste0("CT", rep(1:3, each = 3), "|CT", rep(1:3, 3)),
    int1 = runif(9, 0, 0.1),
    int2 = runif(9, 0, 0.3),
    int3 = runif(9, 0.05, 0.5)
  )
})

combined <- cauchy_combine(pval_methods)

cat("Cauchy combination example:\n")
#> Cauchy combination example:
cat("Input p-values (method 1, int1):", round(pval_methods[[1]]$int1[1:3], 4), "\n")
#> Input p-values (method 1, int1): 0.0915 0.0937 0.0286
cat("Input p-values (method 2, int1):", round(pval_methods[[2]]$int1[1:3], 4), "\n")
#> Input p-values (method 2, int1): 0.0906 0.0447 0.0836
cat("Combined p-values (int1):", round(combined$int1[1:3], 4), "\n")
#> Combined p-values (int1): 0.0117 0.054 0.0519

P-value Computation

From Distribution to P-value

Given an observed interaction score ss and null distribution F0F_0:

p=1F0(s)=P(X>s|H0)p = 1 - F_0(s) = P(X > s | H_0)

For analytic distributions (normal), this is computed exactly. For PMF-based distributions, it uses the CDF.

# P-value computation examples
null_dist <- Distribution$new(dtype = "normal", loc = 5, scale = 2)

observed_values <- c(5, 7, 9, 11)
for (v in observed_values) {
  pval <- get_pvalue(v, null_dist)
  cat(sprintf("Observed: %.1f, P-value: %.4f\n", v, pval))
}
#> Observed: 5.0, P-value: 0.5000
#> Observed: 7.0, P-value: 0.1587
#> Observed: 9.0, P-value: 0.0228
#> Observed: 11.0, P-value: 0.0013

Computational Complexity

Operation Complexity Notes
FFT convolution O(nlogn)O(n \log n) For PMF of length nn
Sum of kk distributions O(knlogn)O(k \cdot n \log n) Using repeated convolution
Minimum of kk distributions O(kn)O(k \cdot n) Element-wise CDF operations
Cauchy combination O(K)O(K) For KK p-values
Full analysis O(GC2I)O(G \cdot C^2 \cdot I) GG genes, CC cell types, II interactions

Summary of Statistical Methods

FastCCCR supports multiple configurations:

Component Options Default
Single-unit summary Mean, Median, Q3, Quantile_q Mean
Complex aggregation Minimum, Average Minimum
L-R combination Arithmetic, Geometric Arithmetic
P-value combination Cauchy -

The Cauchy combination mode runs all 8 combinations (2×2×2) and combines the results.

References

  1. Liu, Y., et al. (2022). “A Cauchy combination test for aggregating p-values.” Biometrics.
  2. Efremova, M., et al. (2020). “CellPhoneDB: inferring cell–cell communication from combined expression of multi-subunit ligand–receptor complexes.” Nature Protocols.

Session Information

sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.18.0 FastCCCR_1.0.0   
#> 
#> loaded via a namespace (and not attached):
#>  [1] cli_3.6.5         knitr_1.51        rlang_1.1.7       xfun_0.56        
#>  [5] otel_0.2.0        textshaping_1.0.4 jsonlite_2.0.0    listenv_0.10.0   
#>  [9] htmltools_0.5.9   ragg_1.5.0        sass_0.4.10       rmarkdown_2.30   
#> [13] grid_4.4.0        evaluate_1.0.5    jquerylib_0.1.4   fastmap_1.2.0    
#> [17] yaml_2.3.12       lifecycle_1.0.5   compiler_4.4.0    codetools_0.2-20 
#> [21] fs_1.6.6          Rcpp_1.1.1        htmlwidgets_1.6.4 future_1.69.0    
#> [25] lattice_0.22-7    systemfonts_1.3.1 digest_0.6.39     R6_2.6.1         
#> [29] parallelly_1.46.1 parallel_4.4.0    bslib_0.9.0       Matrix_1.7-4     
#> [33] tools_4.4.0       globals_0.18.0    pkgdown_2.1.3     cachem_1.1.0     
#> [37] desc_1.4.3