Preprocess simulated bulk data for model training. This includes log transformation, scaling, and gene filtering based on variance and intersection with prediction data.
Usage
ProcessTrainingData(
simulation,
prediction_data = NULL,
var_cutoff = 0.1,
scaling = c("log_min_max", "log_zscore", "none"),
verbose = TRUE
)Arguments
- simulation
A TorchDeconSimulation object from
SimulateBulk, or a list/matrix of bulk counts.- prediction_data
Matrix or data frame of bulk expression data for prediction (genes in rows, samples in columns). Used to find common genes.
- var_cutoff
Numeric. Filter out genes with variance below this threshold. Default is 0.1.
- scaling
Character. Scaling method to use. One of "log_min_max" (default), "log_zscore", or "none".
- verbose
Logical. Print progress messages. Default is TRUE.
Value
A list containing:
- X
Processed expression matrix (samples x genes), ready for training
- Y
Cell type fractions matrix (samples x cell types)
- genes
Character vector of genes used (signature genes)
- celltypes
Character vector of cell type names
- scaling
Scaling method used
- scaling_params
Parameters for scaling (for applying to new data)