Map Single Cells to Reference Gene Expression Programs

Projects single-cell gene expression data onto a reference set of gene expression programs (GEPs) using non-negative matrix factorization. This enables cell type annotation and state characterization based on established program definitions.

Usage

CellProgramMapper(
  query,
  reference = "TCAT.V1",
  assay = NULL,
  layer = "counts",
  return_unnormalized = FALSE,
  method = c("cd", "active_set"),
  max_iter = 1000L,
  tol = 1e-08,
  n_workers = 1L,
  cache_dir = NULL,
  verbose = TRUE
)

Arguments

query

Query data. Accepts multiple input formats:

Seurat object (V4 or V5)
Matrix or dgCMatrix (cells as rows, genes as columns)
data.frame (cells as rows, genes as columns)
File path (.h5ad, .mtx.gz, .tsv, .txt)

reference

Reference spectra. Can be:

Name of a pre-built reference (e.g., "TCAT.V1")
Path to a reference file (.tsv, .txt)

assay

For Seurat objects, which assay to use (default: active assay)

layer

For Seurat objects, which layer/slot to extract (default: "counts")

return_unnormalized

Logical. If TRUE, also return raw usage values. Default is FALSE, returning only normalized usage.

method

NNLS solver algorithm:

"cd" - Coordinate descent (default, generally faster)
"active_set" - Lawson-Hanson active set method

max_iter

Maximum iterations for NNLS solver (default: 1000)

tol

Convergence tolerance for NNLS solver (default: 1e-8)

n_workers

Number of parallel workers for large datasets (default: 1)

cache_dir

Directory for caching downloaded references

verbose

Logical. Print progress messages (default: TRUE)

Value

A CellProgramMapperResult object containing:

usage: Raw usage matrix (cells x programs)
usage_norm: Normalized usage matrix (rows sum to 1)
scores: Computed add-on scores (if defined in reference)
overlap_genes: Genes used for mapping
ref_name: Reference name
n_cells: Number of cells processed
n_programs: Number of programs in reference

Details

The algorithm projects each cell's expression profile onto the reference gene expression programs by solving a non-negative least squares problem:

$$\min_{w_i \geq 0} ||x_i - H^T w_i||_2^2$$

where $x_i$ is the scaled expression vector for cell i, $H$ is the reference spectra matrix, and $w_i$ is the usage vector to be estimated.

Input data is preprocessed by:

Subsetting to genes present in both query and reference
Scaling each gene by its standard deviation (without centering)

Examples

if (FALSE) { # \dontrun{
# With Seurat object
result <- CellProgramMapper(seurat_obj, reference = "TCAT.V1")

# With matrix
result <- CellProgramMapper(counts_matrix, reference = "path/to/ref.tsv")

# Access results
head(result$usage_norm)
head(result$scores)

# Add to Seurat object
seurat_obj <- add_results_to_seurat(seurat_obj, result)
} # }

Usage

Arguments

Value

Details

See also

Examples