Projects single-cell gene expression data onto a reference set of gene expression programs (GEPs) using non-negative matrix factorization. This enables cell type annotation and state characterization based on established program definitions.
Usage
CellProgramMapper(
query,
reference = "TCAT.V1",
assay = NULL,
layer = "counts",
return_unnormalized = FALSE,
method = c("cd", "active_set"),
max_iter = 1000L,
tol = 1e-08,
n_workers = 1L,
cache_dir = NULL,
verbose = TRUE
)Arguments
- query
Query data. Accepts multiple input formats:
Seurat object (V4 or V5)
Matrix or dgCMatrix (cells as rows, genes as columns)
data.frame (cells as rows, genes as columns)
File path (.h5ad, .mtx.gz, .tsv, .txt)
- reference
Reference spectra. Can be:
Name of a pre-built reference (e.g., "TCAT.V1")
Path to a reference file (.tsv, .txt)
- assay
For Seurat objects, which assay to use (default: active assay)
- layer
For Seurat objects, which layer/slot to extract (default: "counts")
- return_unnormalized
Logical. If TRUE, also return raw usage values. Default is FALSE, returning only normalized usage.
- method
NNLS solver algorithm:
"cd" - Coordinate descent (default, generally faster)
"active_set" - Lawson-Hanson active set method
- max_iter
Maximum iterations for NNLS solver (default: 1000)
- tol
Convergence tolerance for NNLS solver (default: 1e-8)
- n_workers
Number of parallel workers for large datasets (default: 1)
- cache_dir
Directory for caching downloaded references
- verbose
Logical. Print progress messages (default: TRUE)
Value
A CellProgramMapperResult object containing:
- usage
Raw usage matrix (cells x programs)
- usage_norm
Normalized usage matrix (rows sum to 1)
- scores
Computed add-on scores (if defined in reference)
- overlap_genes
Genes used for mapping
- ref_name
Reference name
- n_cells
Number of cells processed
- n_programs
Number of programs in reference
Details
The algorithm projects each cell's expression profile onto the reference gene expression programs by solving a non-negative least squares problem:
$$\min_{w_i \geq 0} ||x_i - H^T w_i||_2^2$$
where \(x_i\) is the scaled expression vector for cell i, \(H\) is the reference spectra matrix, and \(w_i\) is the usage vector to be estimated.
Input data is preprocessed by:
Subsetting to genes present in both query and reference
Scaling each gene by its standard deviation (without centering)
See also
available_references for listing pre-built references
add_results_to_seurat for Seurat integration
BuildConsensusReference for building custom references
Examples
if (FALSE) { # \dontrun{
# With Seurat object
result <- CellProgramMapper(seurat_obj, reference = "TCAT.V1")
# With matrix
result <- CellProgramMapper(counts_matrix, reference = "path/to/ref.tsv")
# Access results
head(result$usage_norm)
head(result$scores)
# Add to Seurat object
seurat_obj <- add_results_to_seurat(seurat_obj, result)
} # }