scPAS : Single-Cell Phenotype-Associated Subpopulation identifier

Introduction

This scPAS R package is a bioinformatics tool designed to identify phenotype-associated cell subpopulations in single-cell data by integrating bulk data. This approach provides both quantitative and qualitative estimates of the association between cells and phenotypes. The core function of this package is scPAS. It requires three input data: single-cell RNA-seq data, bulk-seq data, and the phenotype of patients in bulk samples.

Installation

The scPAS is developed under R (version >= 4.0.5). scPAS R package can be installed from Github using devtools:

devtools::install_github("aiminXie/scPAS")

If it cannot be installed automatically, install the following dependencies:

install.packages('Seurat')
install.packages('Rcpp')
install.packages('Matrix')

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("preprocessCore")

library(scPAS)
library(Matrix)
library(Seurat)

Apply scPAS with Cox regression

In this example, we used TCGA-LIHC bulk samples with survival information to identify survival-related cell subpopulations within the LIHC scRNA-seq data.

Load data

The data used in this tutorial is stored at: https://drive.google.com/drive/folders/1Qgdx7YWSkJJ-ZjYHFt9tD33ZC85vQk8o?usp=sharing

# Data available at: https://drive.google.com/drive/folders/1Qgdx7YWSkJJ-ZjYHFt9tD33ZC85vQk8o
load(file = 'data_survival.rda')
dim(sc_dataset)

head(bulk_dataset[,1:10])
head(phenotype)

Prepare the scRNA-seq data

Single-cell data needs to be processed using the Seurat pipeline to create a Seurat object suitable for visualization. To avoid the impact of extremely sparsely expressed genes on the model, they need to be removed beforehand.

keep_genes <- rownames(sc_dataset)[rowSums(sc_dataset@assays$RNA@counts>1) >= 20]
sc_dataset <- subset(sc_dataset,features = keep_genes)

sc_dataset <- run_Seurat(sc_dataset,verbose = FALSE)

library(RColorBrewer)
qual_col_pals = brewer.pal.info[brewer.pal.info$category == 'qual',]
col_vector = unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))
set.seed(seed = 1234)
cellType_col <- sample(col_vector, 7)
names(cellType_col) <- levels(sc_dataset$celltype)

UMAP_celltype <- DimPlot(sc_dataset, reduction ="umap",group.by="celltype",label =T, cols = cellType_col)
UMAP_celltype

Prepare the bulk data and phenotype

The bulk data required by scPAS is a matrix. Similarly, genes that are not expressed in the majority of samples in bulk data will also be removed.

bulk_dataset <- as.matrix(bulk_dataset)
class(bulk_dataset)
dim(bulk_dataset)
bulk_dataset = bulk_dataset[apply(bulk_dataset, 1, function(x)sum(x==0) < 0.25 *ncol(bulk_dataset)),]
dim(bulk_dataset)

The clinical survival phenotype requires a two-column matrix with columns named ‘time’ and ‘status’. The first column contains the survival time. The second column is a binary variable, with ‘1’ indicating event (e.g.recurrence of cancer or death), and ‘0’ indicating right-censored.

head(phenotype)

The phenotypic data must match the samples of the bulk expression profile.

all(rownames(phenotype)==colnames(bulk_dataset))

Run scPAS without imputation

The input provided above is survival data; therefore, scPAS needs to fit a Cox regression model (family = ‘cox’).
do_imputation = FALSE , without imputation.
nfeature = 3000 indicates that the top 3000 highly variable genes are selected for model training.
network_class = ‘SC’ indicates gene-gene similarity networks derived from single-cell data.

time.start <- Sys.time()

sc_dataset <- scPAS(bulk_dataset = bulk_dataset,sc_dataset = sc_dataset,assay = 'RNA',phenotype = phenotype,do_imputation = FALSE, nfeature = 3000,alpha = 0.01,network_class = 'SC',family = 'cox')
time.end <- Sys.time()

time.end-time.start

scPAS will return a Seurat object, and the predicted results are added to the metadata.

head(sc_dataset@meta.data)

The scPAS provides both quantitative (scPAS_NRS) and qualitative (scPAS) prediction results:

library(ggplot2)
UMAP_scPAS <- DimPlot(sc_dataset, reduction = 'umap',
                         group.by = 'scPAS',raster=FALSE,
                         cols = c("0"='grey',"scPAS+"='indianred1',"scPAS-"='royalblue'),
                         pt.size = 0.001, order = c("scPAS-","scPAS+")) + ggtitle(label = NULL) + labs(col = "scPAS")
UMAP_RS <- FeaturePlot(object = sc_dataset,raster=FALSE,features = 'scPAS_NRS',max.cutoff = 2,min.cutoff = -2) + scale_colour_gradientn(colours = rev(brewer.pal(n = 10, name = "RdBu")))+
  ggtitle(label = NULL) + labs(col = "Risk score")

UMAP_celltype | UMAP_RS | UMAP_scPAS

Run scPAS with imputation

do_imputation = TRUE , implement single-cell data imputation using the default KNN method.

time.start <- Sys.time()

sc_dataset <- scPAS(bulk_dataset = bulk_dataset,sc_dataset = sc_dataset,assay = 'RNA',phenotype = phenotype,do_imputation = TRUE, nfeature = 3000,alpha = 0.01,network_class = 'SC',family = 'cox')
time.end <- Sys.time()

time.end-time.start

A new assay (imputation) has been created to store the imputed expression profiles.

sc_dataset

Using imputed expression profiles for prediction, scPAS will capture more cells.

library(ggplot2)
UMAP_scPAS <- DimPlot(sc_dataset, reduction = 'umap',
                         group.by = 'scPAS',raster=FALSE,
                         cols = c("0"='grey',"scPAS+"='indianred1',"scPAS-"='royalblue'),
                         pt.size = 0.001, order = c("scPAS-","scPAS+")) + ggtitle(label = NULL) + labs(col = "scPAS")
UMAP_RS <- FeaturePlot(object = sc_dataset,raster=FALSE,features = 'scPAS_NRS',max.cutoff = 2,min.cutoff = -2) + scale_colour_gradientn(colours = rev(brewer.pal(n = 10, name = "RdBu")))+
  ggtitle(label = NULL) + labs(col = "Risk score")

UMAP_celltype | UMAP_RS | UMAP_scPAS

Other imputation methods can also be used. Users can manually create a new assay (e.g., imputation) to store the imputed expression profiles. By changing the assay parameter in the scPAS function to the corresponding assay name (e.g., assay = ‘imputation’), scPAS will use the imputed expression profiles for prediction.

Apply the trained model to independent bulk data

The returned Seurat object not only contains the prediction results for each single cell but also provides the trained model.

names(sc_dataset@misc$scPAS_para)
head(sort(sc_dataset@misc$scPAS_para$Coefs))
tail(sort(sc_dataset@misc$scPAS_para$Coefs))

Applying this model to other independent bulk data can enable the prediction of prognostic risk for bulk samples.

load(file = '../data/LIRI_data.Rdata')
LIRI_exp[1:6,1:6]
head(LIRI_sur)

Use the scPAS.prediction function to perform predictions on independent data.

LIRI_pre_res <- scPAS.prediction(model = sc_dataset,test.data = LIRI_exp)
head(LIRI_pre_res)

Survival analysis shows that applying this model to the ICGC-LIRI dataset can effectively predict patient survival outcomes.

library("survival")
library("survminer")
LIRI_pre_res$status <- LIRI_sur$status
LIRI_pre_res$time <- LIRI_sur$time

fit <- survfit(Surv(time, status) ~ scPAS, data = LIRI_pre_res)
survplot_LIRI <- ggsurvplot(
  fit, title='LIRI',
  data = LIRI_pre_res,
  size = 1,                
  conf.int = F,         
  pval.method = TRUE,
  pval = TRUE,            
  risk.table = F,       
  risk.table.height = 0.25,
  xlab ='Days',
  ylab='Survival probability',
  legend=c(0.7,0.2),
  legend.title=''
)

survplot_LIRI$plot

Apply the trained model to spatial transcriptomics data

ST_data <- readRDS(file = '../data/HCC_ST_L2.RDS')

SP1 <- SpatialDimPlot(ST_data, label = F, repel = T, label.size = 4,group.by = 'seurat_clusters',alpha = 0,
                      pt.size.factor = 1.6) + theme(legend.position = 'None')
SP2 <- SpatialDimPlot(ST_data, label = T,stroke = NA, repel = T, label.size = 4,group.by = 'seurat_clusters',alpha = 0.9,
                      pt.size.factor = 1.35) + theme(legend.position = "bottom")
SP1 | SP2

Directly applying the trained model to spatial transcriptomics data can reveal the spatial localization of prognostic-related cells

ST_data <- scPAS.prediction(model = sc_dataset,test.data = ST_data,assay = 'Spatial',do_imputation = TRUE,independent = TRUE)
head(ST_data)

The results show that cells associated with poor survival are primarily located in regions enriched with malignant liver cancer cells.


p3 <- SpatialFeaturePlot(ST_data, features = "scPAS_NRS",stroke=NA,pt.size.factor = 1.27,min.cutoff = -2,max.cutoff = 2)  + 
  ggtitle(label = NULL) + theme(legend.position = "top") + scale_fill_gradientn(name='Risk score',colours = Seurat:::SpatialColors(n = 100))

p4 <- SpatialDimPlot(ST_data,group.by = 'histology',cols = c("malignant"='indianred1',"non-malignant"= NA), stroke=NA,
                     alpha = c(1,0.1),pt.size.factor = 1.27) + ggtitle(label = NULL) + labs(col = "pathologists") + theme(legend.position = "top")

p5 <- SpatialDimPlot(ST_data,group.by = 'scPAS',cols = c("0"='grey',"scPAS+"='indianred1',"scPAS-"='royalblue'), stroke=NA,
                     alpha = c(1,0.1),pt.size.factor = 1.27) + ggtitle(label = NULL) + labs(col = "scPAS") + theme(legend.position = "top")

p4 | p3 | p5

Information about the current R session

sessionInfo()

Aimin Xie

2024-08-25