Skip to contents

Projects single-cell gene expression data onto a reference set of gene expression programs (GEPs) using non-negative matrix factorization. This enables cell type annotation and state characterization based on established program definitions.

Usage

CellProgramMapper(
  query,
  reference = "TCAT.V1",
  assay = NULL,
  layer = "counts",
  return_unnormalized = FALSE,
  method = c("cd", "active_set"),
  max_iter = 1000L,
  tol = 1e-08,
  n_workers = 1L,
  cache_dir = NULL,
  verbose = TRUE
)

Arguments

query

Query data. Accepts multiple input formats:

  • Seurat object (V4 or V5)

  • Matrix or dgCMatrix (cells as rows, genes as columns)

  • data.frame (cells as rows, genes as columns)

  • File path (.h5ad, .mtx.gz, .tsv, .txt)

reference

Reference spectra. Can be:

  • Name of a pre-built reference (e.g., "TCAT.V1")

  • Path to a reference file (.tsv, .txt)

assay

For Seurat objects, which assay to use (default: active assay)

layer

For Seurat objects, which layer/slot to extract (default: "counts")

return_unnormalized

Logical. If TRUE, also return raw usage values. Default is FALSE, returning only normalized usage.

method

NNLS solver algorithm:

  • "cd" - Coordinate descent (default, generally faster)

  • "active_set" - Lawson-Hanson active set method

max_iter

Maximum iterations for NNLS solver (default: 1000)

tol

Convergence tolerance for NNLS solver (default: 1e-8)

n_workers

Number of parallel workers for large datasets (default: 1)

cache_dir

Directory for caching downloaded references

verbose

Logical. Print progress messages (default: TRUE)

Value

A CellProgramMapperResult object containing:

usage

Raw usage matrix (cells x programs)

usage_norm

Normalized usage matrix (rows sum to 1)

scores

Computed add-on scores (if defined in reference)

overlap_genes

Genes used for mapping

ref_name

Reference name

n_cells

Number of cells processed

n_programs

Number of programs in reference

Details

The algorithm projects each cell's expression profile onto the reference gene expression programs by solving a non-negative least squares problem:

$$\min_{w_i \geq 0} ||x_i - H^T w_i||_2^2$$

where \(x_i\) is the scaled expression vector for cell i, \(H\) is the reference spectra matrix, and \(w_i\) is the usage vector to be estimated.

Input data is preprocessed by:

  1. Subsetting to genes present in both query and reference

  2. Scaling each gene by its standard deviation (without centering)

See also

available_references for listing pre-built references add_results_to_seurat for Seurat integration BuildConsensusReference for building custom references

Examples

if (FALSE) { # \dontrun{
# With Seurat object
result <- CellProgramMapper(seurat_obj, reference = "TCAT.V1")

# With matrix
result <- CellProgramMapper(counts_matrix, reference = "path/to/ref.tsv")

# Access results
head(result$usage_norm)
head(result$scores)

# Add to Seurat object
seurat_obj <- add_results_to_seurat(seurat_obj, result)
} # }