Reference-Based Inference

Introduction

Reference-based inference is a powerful feature of FastCCCR developed by Zaoqu Liu that enables rapid cell-cell communication analysis by leveraging pre-computed reference panels from large-scale atlas datasets. This approach is particularly useful for:

Consistency: Comparing query samples against a common reference
Speed: Rapid inference without recomputing null distributions
Biological interpretation: Identifying differential interactions

library(FastCCCR)
library(data.table)

Conceptual Overview

The Reference-Based Approach

┌─────────────────────────────────────────────────────────────────┐
│                  Reference-Based Inference                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐                                           │
│  │  Atlas Data     │──────► Build Reference                     │
│  │  (Large-scale)  │        • Pre-compute null distributions    │
│  └─────────────────┘        • Store gene statistics             │
│                             • Record cell type info             │
│           │                                                     │
│           ▼                                                     │
│  ┌─────────────────┐                                           │
│  │  Reference      │                                           │
│  │  Panel          │◄─────────────────────────────────┐        │
│  └─────────────────┘                                  │        │
│           │                                           │        │
│           ▼                                           │        │
│  ┌─────────────────┐      ┌─────────────────┐        │        │
│  │  Query Data     │──────►│  Infer Query    │────────┘        │
│  │  (Your sample)  │       │  • Fast inference                 │
│  └─────────────────┘       │  • Comparison with reference      │
│                            └─────────────────┘                 │
│                                    │                            │
│                                    ▼                            │
│                            ┌─────────────────┐                 │
│                            │  Results        │                 │
│                            │  • Significant CCC                │
│                            │  • Up/Down vs Ref                 │
│                            └─────────────────┘                 │
└─────────────────────────────────────────────────────────────────┘

Reference Panel Structure

A reference panel contains the following pre-computed information:

File	Content
`config.toml`	Reference metadata and settings
`basic_info_dict.pkl`	Gene-level statistics (mean, SD, PMF)
`ref_gene_pmf_dict.pkl`	Pre-computed sum distributions for n=1..99
`ref_mean_counts.pkl`	Cluster mean expression
`ref_percents.pkl`	Cluster expression percentages
`complex_table.pkl`	Protein complex composition
`interactions.pkl`	Ligand-receptor interactions
`ref_hk.txt`	Housekeeping gene expression for calibration

Building a Reference Panel

Workflow

# Build reference from atlas Seurat object
build_reference(
  seurat_obj = atlas_seurat,
  reference_name = "my_tissue_atlas",
  database = "CPDBv5.0.0",
  celltype_col = "cell_type",
  save_path = "./reference/",
  min_percentile = 0.1,
  min_genes_per_cell = 50L
)

Key Steps

Quality Control: Filter cells with fewer than min_genes_per_cell expressed genes
Digitization: Rank-based transformation of expression values
Null Distribution: Pre-compute distributions for cluster sizes 1-99
Statistics Storage: Save mean counts, percentages, and PMFs

Digitization Transform

FastCCCR uses a rank-based digitization to normalize expression across datasets:

# Simulate digitization process
set.seed(42)
raw_expr <- rlnorm(100, meanlog = 2, sdlog = 1)

# Digitize to 0-50 bins
n_bins <- 50L
bins <- quantile(raw_expr, probs = seq(0, 1, length.out = n_bins - 1), type = 1)
bins <- unique(bins)
digits <- findInterval(raw_expr, bins, left.open = FALSE)

cat("Original expression range:", round(range(raw_expr), 2), "\n")
#> Original expression range: 0.37 72.72
cat("Digitized range:", range(digits), "\n")
#> Digitized range: 1 49
cat("Digitized distribution:\n")
#> Digitized distribution:
print(table(digits)[1:10])
#> digits
#>  1  2  3  4  5  6  7  8  9 10 
#>  2  2  2  2  2  2  2  2  2  2

Inference with Reference

Basic Usage

# Infer CCC for query data using reference
results <- infer_query(
  seurat_obj = query_seurat,
  reference_path = "./reference/my_tissue_atlas",
  database = "CPDBv5.0.0",
  celltype_col = "cell_type"
)

Cell Type Mapping

When query cell types don’t exactly match reference cell types, use mapping:

results <- infer_query(
  seurat_obj = query_seurat,
  reference_path = "./reference/my_tissue_atlas",
  database = "CPDBv5.0.0",
  celltype_col = "cell_type",
  celltype_mapping = list(
    "Ref_Tcell" = "Query_CD8T",
    "Ref_Tcell" = "Query_CD4T",  # Multiple query → one reference
    "Ref_Bcell" = "Query_Bcell",
    "Ref_Macro" = "Query_Macrophage"
  )
)

Calibration Factor (k)

Purpose

The calibration factor k adjusts for technical variability between reference and query datasets. It’s calculated using housekeeping genes:

$k = \frac{\bar{X}_{\text{HK}}^{\text{ref}}}{\text{SD}(X_{\text{HK}}^{\text{query}} - X_{\text{HK}}^{\text{ref}})}$

Confidence Interval

The significance threshold is adjusted with confidence bounds:

$\text{threshold}_{\text{upper}} = \theta + \frac{\theta}{k} \times 1.96$ $\text{threshold}_{\text{lower}} = \theta - \frac{\theta}{k} \times 1.96$

Interpretation

k value	Interpretation
k > 5	High concordance between datasets
3 < k < 5	Moderate concordance
k < 3	Low concordance, wider confidence intervals

# Demonstrate threshold adjustment
base_threshold <- 0.15
k_values <- c(2, 3, 5, 10)

for (k in k_values) {
  upper <- base_threshold + base_threshold / k * 1.96
  lower <- base_threshold - base_threshold / k * 1.96
  cat(sprintf("k = %2d: threshold range [%.4f, %.4f]\n", k, lower, upper))
}
#> k =  2: threshold range [0.0030, 0.2970]
#> k =  3: threshold range [0.0520, 0.2480]
#> k =  5: threshold range [0.0912, 0.2088]
#> k = 10: threshold range [0.1206, 0.1794]

Decision Logic

The inference uses a multi-level decision tree:

                        ┌──────────────────┐
                        │  Observed Score  │
                        │       (IS)       │
                        └────────┬─────────┘
                                 │
            ┌────────────────────┼────────────────────┐
            │                    │                    │
            ▼                    ▼                    ▼
    ┌───────────────┐   ┌───────────────┐   ┌───────────────┐
    │ IS > upper    │   │ lower < IS    │   │ IS < lower    │
    │               │   │     < upper   │   │               │
    └───────┬───────┘   └───────┬───────┘   └───────┬───────┘
            │                   │                   │
            ▼                   │                   ▼
    ┌───────────────┐           │           ┌───────────────┐
    │  SIGNIFICANT  │           │           │NOT SIGNIFICANT│
    │  (est_by_ref) │           │           │  (est_by_ref) │
    └───────────────┘           │           └───────────────┘
                                ▼
                    ┌───────────────────────┐
                    │  Check Reference      │
                    │  Significance         │
                    └───────────┬───────────┘
                                │
              ┌─────────────────┴─────────────────┐
              │                                   │
              ▼                                   ▼
      ┌───────────────┐                   ┌───────────────┐
      │ Ref Sig = TRUE│                   │ Ref Sig = FALSE│
      └───────┬───────┘                   └───────┬───────┘
              │                                   │
              ▼                                   ▼
    Compare L & R with              Compare L & R with
    reference bounds                reference bounds

Output Interpretation

Result Columns

Column	Description
`pair`	Cell type pair (sender\|receiver)
`LRI_ID`	Ligand-receptor interaction ID
`ligand`, `receptor`	Gene symbols
`comm_score`	Interaction score in query
`null_score`	Background interaction score
`threshold_range`	Confidence interval for threshold
`is_significant`	Final significance call
`in_reference`	Whether pair exists in reference
`ref_significant`	Reference significance status
`comparison`	Differential status (Up/Down/Both_Sig/Both_NS)

Comparison Categories

Comparison	Query	Reference	Interpretation
Both_Sig	✓	✓	Consistent active interaction
Both_NS	✗	✗	Consistent inactive interaction
Up	✓	✗	Gained interaction in query
Down	✗	✓	Lost interaction in query

Use Cases

1. Disease vs. Normal Comparison

# Build reference from healthy tissue atlas
build_reference(
  seurat_obj = healthy_atlas,
  reference_name = "healthy_tissue",
  ...
)

# Analyze disease samples
disease_results <- infer_query(
  seurat_obj = disease_sample,
  reference_path = "./reference/healthy_tissue",
  ...
)

# Find gained interactions
gained <- disease_results$results[comparison == "Up"]

2. Treatment Response

# Reference: pre-treatment samples
# Query: post-treatment samples
# Find lost interactions (potentially beneficial)
lost <- results$results[comparison == "Down"]

3. Cross-Study Validation

# Reference: published atlas
# Query: your data
# Find consistent interactions
consistent <- results$results[comparison == "Both_Sig"]

Best Practices

Reference Quality: Use high-quality atlas data with sufficient cells per type
Cell Type Matching: Ensure consistent cell type annotations
Calibration Check: Monitor the k value for dataset compatibility
Multiple References: Consider building tissue-specific references

Session Information

sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.18.0 FastCCCR_1.0.0   
#> 
#> loaded via a namespace (and not attached):
#>  [1] cli_3.6.5         knitr_1.51        rlang_1.1.7       xfun_0.56        
#>  [5] otel_0.2.0        textshaping_1.0.4 jsonlite_2.0.0    listenv_0.10.0   
#>  [9] htmltools_0.5.9   ragg_1.5.0        sass_0.4.10       rmarkdown_2.30   
#> [13] grid_4.4.0        evaluate_1.0.5    jquerylib_0.1.4   fastmap_1.2.0    
#> [17] yaml_2.3.12       lifecycle_1.0.5   compiler_4.4.0    codetools_0.2-20 
#> [21] fs_1.6.6          Rcpp_1.1.1        htmlwidgets_1.6.4 future_1.69.0    
#> [25] lattice_0.22-7    systemfonts_1.3.1 digest_0.6.39     R6_2.6.1         
#> [29] parallelly_1.46.1 parallel_4.4.0    bslib_0.9.0       Matrix_1.7-4     
#> [33] tools_4.4.0       globals_0.18.0    pkgdown_2.1.3     cachem_1.1.0     
#> [37] desc_1.4.3

Zaoqu Liu

2026-01-26