scPAS : A tool for identifying Phenotype-Associated cell Subpopulations from single-cell sequencing data by integrating bulk data

Usage

scPAS(
  bulk_dataset,
  sc_dataset,
  phenotype,
  assay = "RNA",
  tag = NULL,
  nfeature = NULL,
  do_imputation = TRUE,
  imputation_method = c("KNN", "ALRA"),
  alpha = NULL,
  network_class = c("SC", "bulk"),
  independent = TRUE,
  family = c("gaussian", "binomial", "cox"),
  permutation_times = 2000,
  FDR.threshold = 0.05,
  n_cores = 1
)

Arguments

bulk_dataset

Matrix. Bulk expression matrix of related disease. Each row represents a gene and each column represents a sample. The input expression values are continuous, such as microarray fluorescent units in logarithmic scale, RNA-seq log-CPMs, log-RPKMs or log-TPMs.

sc_dataset

Matrix or seurat object. Single-cell RNA-seq expression matrix of related disease. Each row represents a gene and each column represents a sample. A Seurat object that contains the preprocessed data and constructed network is preferred. Otherwise, a cell-cell similarity network is constructed based on the input matrix.Otherwise, the raw count expression matrix will be processed by using Seurat's default parameters. See run_Seurat for details.

phenotype

Phenotype annotation of each bulk sample. It can be a continuous dependent variable, binary group indicator vector, or clinical survival data:

Continuous dependent variable. Should be a quantitative vector for family = gaussian.
Binary group indicator vector. Should be either a 0-1 encoded vector or a factor with two levels for family = binomial.
Clinical survival data. Should be a two-column matrix with columns named 'time' and 'status'. The latter is a binary variable, with '1' indicating event (e.g.recurrence of cancer or death), and '0' indicating right censored. The function Surv() in package survival produces such a matrix.

assay

Name of Assay to get.

tag

Names for each phenotypic group. Used for logistic regressions only.

nfeature

Numeric. The Number of features to select as top variable features in sc_dataset. Top variable features will be used to intersect with the features of bulk_dataset. Default is NULL.All features will be used.

do_imputation

Logical. Whether to perform imputation on single-cell data (default: TRUE).

imputation_method

Character. Name of alternative method for imputation.

alpha

Numeric. Parameter used to balance the effect of the l1 norm and the network-based penalties. It can be a number or a searching vector. If alpha = NULL, a default searching vector is used. The range of alpha is in [0,1]. A larger alpha lays more emphasis on the l1 norm.

network_class

The source of feature-feature similarity network. By default this is set to sc and the other one is bulk.

independent

Logical. The background distribution of risk scores is constructed independently of each cell.

family

Character. Response type for the regression model. It depends on the type of the given phenotype and can be family = gaussian for linear regression, family = binomial for classification, or family = cox for Cox regression.

permutation_times

Integer. Number of permutation iterations for statistical significance testing (default: 2000). Higher values increase accuracy but also computation time. Recommended: 1000-5000. For faster testing, use 500-1000.

FDR.threshold

Numeric. FDR value threshold for identifying phenotype-associated cells. The default is 0.05.

n_cores

Integer. Number of CPU cores to use for parallel permutation test (default: 1 for sequential processing). Setting n_cores > 1 enables parallel computing which can significantly speed up the analysis (2-4x faster with 4 cores). Requires 'future' and 'future.apply' packages.

Value

This function returns a Seurat object with the following components added to :

scPAS_para: A list contains the final model parameters added to misc.
PAS result: A data frame containing risk scores (scPAS_RS), normalized risk scores (scPAS_NRS), p-value (scPAS_Pvalue) , adjusted p-value (scPAS_FDR) cell classification labels (scPAS) added to metaData.