Skip to contents

Preprocess simulated bulk data for model training. This includes log transformation, scaling, and gene filtering based on variance and intersection with prediction data.

Usage

ProcessTrainingData(
  simulation,
  prediction_data = NULL,
  var_cutoff = 0.1,
  scaling = c("log_min_max", "log_zscore", "none"),
  verbose = TRUE
)

Arguments

simulation

A TorchDeconSimulation object from SimulateBulk, or a list/matrix of bulk counts.

prediction_data

Matrix or data frame of bulk expression data for prediction (genes in rows, samples in columns). Used to find common genes.

var_cutoff

Numeric. Filter out genes with variance below this threshold. Default is 0.1.

scaling

Character. Scaling method to use. One of "log_min_max" (default), "log_zscore", or "none".

verbose

Logical. Print progress messages. Default is TRUE.

Value

A list containing:

X

Processed expression matrix (samples x genes), ready for training

Y

Cell type fractions matrix (samples x cell types)

genes

Character vector of genes used (signature genes)

celltypes

Character vector of cell type names

scaling

Scaling method used

scaling_params

Parameters for scaling (for applying to new data)

Details

The preprocessing pipeline:

  1. Find common genes between training and prediction data

  2. Filter genes by variance threshold

  3. Apply log2(x + 1) transformation

  4. Apply sample-wise min-max scaling (or z-score)

Examples

if (FALSE) { # \dontrun{
# Basic processing
processed <- ProcessTrainingData(simulation, prediction_data = bulk_data)

# Custom variance cutoff
processed <- ProcessTrainingData(
  simulation,
  prediction_data = bulk_data,
  var_cutoff = 0.05,
  scaling = "log_min_max"
)
} # }