Process Training Data for TorchDecon — ProcessTrainingData • TorchDecon

Preprocess simulated bulk data for model training. This includes log transformation, scaling, and gene filtering based on variance and intersection with prediction data.

Usage

ProcessTrainingData(
  simulation,
  prediction_data = NULL,
  var_cutoff = 0.1,
  scaling = c("log_min_max", "log_zscore", "none"),
  verbose = TRUE
)

Arguments

simulation: A TorchDeconSimulation object from SimulateBulk, or a list/matrix of bulk counts.
prediction_data: Matrix or data frame of bulk expression data for prediction (genes in rows, samples in columns). Used to find common genes.
var_cutoff: Numeric. Filter out genes with variance below this threshold. Default is 0.1.
scaling: Character. Scaling method to use. One of "log_min_max" (default), "log_zscore", or "none".
verbose: Logical. Print progress messages. Default is TRUE.

Value

A list containing:

X: Processed expression matrix (samples x genes), ready for training
Y: Cell type fractions matrix (samples x cell types)
genes: Character vector of genes used (signature genes)
celltypes: Character vector of cell type names
scaling: Scaling method used
scaling_params: Parameters for scaling (for applying to new data)

Details

The preprocessing pipeline:

Find common genes between training and prediction data
Filter genes by variance threshold
Apply log2(x + 1) transformation
Apply sample-wise min-max scaling (or z-score)

Examples

if (FALSE) { # \dontrun{
# Basic processing
processed <- ProcessTrainingData(simulation, prediction_data = bulk_data)

# Custom variance cutoff
processed <- ProcessTrainingData(
  simulation,
  prediction_data = bulk_data,
  var_cutoff = 0.05,
  scaling = "log_min_max"
)
} # }