Reference-Based Inference
Zaoqu Liu
2026-01-26
Source:vignettes/reference-based.Rmd
reference-based.RmdIntroduction
Reference-based inference is a powerful feature of FastCCCR developed by Zaoqu Liu that enables rapid cell-cell communication analysis by leveraging pre-computed reference panels from large-scale atlas datasets. This approach is particularly useful for:
- Consistency: Comparing query samples against a common reference
- Speed: Rapid inference without recomputing null distributions
- Biological interpretation: Identifying differential interactions
Conceptual Overview
The Reference-Based Approach
┌─────────────────────────────────────────────────────────────────┐
│ Reference-Based Inference │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ Atlas Data │──────► Build Reference │
│ │ (Large-scale) │ • Pre-compute null distributions │
│ └─────────────────┘ • Store gene statistics │
│ • Record cell type info │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Reference │ │
│ │ Panel │◄─────────────────────────────────┐ │
│ └─────────────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ Query Data │──────►│ Infer Query │────────┘ │
│ │ (Your sample) │ │ • Fast inference │
│ └─────────────────┘ │ • Comparison with reference │
│ └─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Results │ │
│ │ • Significant CCC │
│ │ • Up/Down vs Ref │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Reference Panel Structure
A reference panel contains the following pre-computed information:
| File | Content |
|---|---|
config.toml |
Reference metadata and settings |
basic_info_dict.pkl |
Gene-level statistics (mean, SD, PMF) |
ref_gene_pmf_dict.pkl |
Pre-computed sum distributions for n=1..99 |
ref_mean_counts.pkl |
Cluster mean expression |
ref_percents.pkl |
Cluster expression percentages |
complex_table.pkl |
Protein complex composition |
interactions.pkl |
Ligand-receptor interactions |
ref_hk.txt |
Housekeeping gene expression for calibration |
Building a Reference Panel
Workflow
# Build reference from atlas Seurat object
build_reference(
seurat_obj = atlas_seurat,
reference_name = "my_tissue_atlas",
database = "CPDBv5.0.0",
celltype_col = "cell_type",
save_path = "./reference/",
min_percentile = 0.1,
min_genes_per_cell = 50L
)Key Steps
-
Quality Control: Filter cells with fewer than
min_genes_per_cellexpressed genes - Digitization: Rank-based transformation of expression values
- Null Distribution: Pre-compute distributions for cluster sizes 1-99
- Statistics Storage: Save mean counts, percentages, and PMFs
Digitization Transform
FastCCCR uses a rank-based digitization to normalize expression across datasets:
# Simulate digitization process
set.seed(42)
raw_expr <- rlnorm(100, meanlog = 2, sdlog = 1)
# Digitize to 0-50 bins
n_bins <- 50L
bins <- quantile(raw_expr, probs = seq(0, 1, length.out = n_bins - 1), type = 1)
bins <- unique(bins)
digits <- findInterval(raw_expr, bins, left.open = FALSE)
cat("Original expression range:", round(range(raw_expr), 2), "\n")
#> Original expression range: 0.37 72.72
cat("Digitized range:", range(digits), "\n")
#> Digitized range: 1 49
cat("Digitized distribution:\n")
#> Digitized distribution:
print(table(digits)[1:10])
#> digits
#> 1 2 3 4 5 6 7 8 9 10
#> 2 2 2 2 2 2 2 2 2 2Inference with Reference
Basic Usage
# Infer CCC for query data using reference
results <- infer_query(
seurat_obj = query_seurat,
reference_path = "./reference/my_tissue_atlas",
database = "CPDBv5.0.0",
celltype_col = "cell_type"
)Cell Type Mapping
When query cell types don’t exactly match reference cell types, use mapping:
results <- infer_query(
seurat_obj = query_seurat,
reference_path = "./reference/my_tissue_atlas",
database = "CPDBv5.0.0",
celltype_col = "cell_type",
celltype_mapping = list(
"Ref_Tcell" = "Query_CD8T",
"Ref_Tcell" = "Query_CD4T", # Multiple query → one reference
"Ref_Bcell" = "Query_Bcell",
"Ref_Macro" = "Query_Macrophage"
)
)Calibration Factor (k)
Purpose
The calibration factor k adjusts for technical
variability between reference and query datasets. It’s calculated using
housekeeping genes:
Interpretation
| k value | Interpretation |
|---|---|
| k > 5 | High concordance between datasets |
| 3 < k < 5 | Moderate concordance |
| k < 3 | Low concordance, wider confidence intervals |
# Demonstrate threshold adjustment
base_threshold <- 0.15
k_values <- c(2, 3, 5, 10)
for (k in k_values) {
upper <- base_threshold + base_threshold / k * 1.96
lower <- base_threshold - base_threshold / k * 1.96
cat(sprintf("k = %2d: threshold range [%.4f, %.4f]\n", k, lower, upper))
}
#> k = 2: threshold range [0.0030, 0.2970]
#> k = 3: threshold range [0.0520, 0.2480]
#> k = 5: threshold range [0.0912, 0.2088]
#> k = 10: threshold range [0.1206, 0.1794]Decision Logic
The inference uses a multi-level decision tree:
┌──────────────────┐
│ Observed Score │
│ (IS) │
└────────┬─────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ IS > upper │ │ lower < IS │ │ IS < lower │
│ │ │ < upper │ │ │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
▼ │ ▼
┌───────────────┐ │ ┌───────────────┐
│ SIGNIFICANT │ │ │NOT SIGNIFICANT│
│ (est_by_ref) │ │ │ (est_by_ref) │
└───────────────┘ │ └───────────────┘
▼
┌───────────────────────┐
│ Check Reference │
│ Significance │
└───────────┬───────────┘
│
┌─────────────────┴─────────────────┐
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Ref Sig = TRUE│ │ Ref Sig = FALSE│
└───────┬───────┘ └───────┬───────┘
│ │
▼ ▼
Compare L & R with Compare L & R with
reference bounds reference bounds
Output Interpretation
Result Columns
| Column | Description |
|---|---|
pair |
Cell type pair (sender|receiver) |
LRI_ID |
Ligand-receptor interaction ID |
ligand, receptor
|
Gene symbols |
comm_score |
Interaction score in query |
null_score |
Background interaction score |
threshold_range |
Confidence interval for threshold |
is_significant |
Final significance call |
in_reference |
Whether pair exists in reference |
ref_significant |
Reference significance status |
comparison |
Differential status (Up/Down/Both_Sig/Both_NS) |
Use Cases
1. Disease vs. Normal Comparison
# Build reference from healthy tissue atlas
build_reference(
seurat_obj = healthy_atlas,
reference_name = "healthy_tissue",
...
)
# Analyze disease samples
disease_results <- infer_query(
seurat_obj = disease_sample,
reference_path = "./reference/healthy_tissue",
...
)
# Find gained interactions
gained <- disease_results$results[comparison == "Up"]Best Practices
- Reference Quality: Use high-quality atlas data with sufficient cells per type
- Cell Type Matching: Ensure consistent cell type annotations
-
Calibration Check: Monitor the
kvalue for dataset compatibility - Multiple References: Consider building tissue-specific references
Session Information
sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C
#>
#> time zone: Asia/Shanghai
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] data.table_1.18.0 FastCCCR_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] cli_3.6.5 knitr_1.51 rlang_1.1.7 xfun_0.56
#> [5] otel_0.2.0 textshaping_1.0.4 jsonlite_2.0.0 listenv_0.10.0
#> [9] htmltools_0.5.9 ragg_1.5.0 sass_0.4.10 rmarkdown_2.30
#> [13] grid_4.4.0 evaluate_1.0.5 jquerylib_0.1.4 fastmap_1.2.0
#> [17] yaml_2.3.12 lifecycle_1.0.5 compiler_4.4.0 codetools_0.2-20
#> [21] fs_1.6.6 Rcpp_1.1.1 htmlwidgets_1.6.4 future_1.69.0
#> [25] lattice_0.22-7 systemfonts_1.3.1 digest_0.6.39 R6_2.6.1
#> [29] parallelly_1.46.1 parallel_4.4.0 bslib_0.9.0 Matrix_1.7-4
#> [33] tools_4.4.0 globals_0.18.0 pkgdown_2.1.3 cachem_1.1.0
#> [37] desc_1.4.3