Best Practices and Troubleshooting
Zaoqu Liu
2026-01-23
Source:vignettes/best-practices.Rmd
best-practices.RmdBest Practices
This vignette provides recommendations for optimal use of MultiK and solutions to common issues.
1. Data Preprocessing
1.1 Quality Control
Before running MultiK, ensure your data has been properly quality-controlled:
library(Seurat)
# Standard QC metrics
seu <- PercentageFeatureSet(seu, pattern = "^MT-", col.name = "percent.mt")
# Filter cells
seu <- subset(seu, subset =
nFeature_RNA > 200 &
nFeature_RNA < 5000 &
percent.mt < 20
)1.2 Recommended Preprocessing
MultiK handles normalization internally, but for consistency:
# If you want to pre-process
seu <- NormalizeData(seu)
seu <- FindVariableFeatures(seu, nfeatures = 2000)
# Pass to MultiK - it will scale and run PCA
result <- MultiK(seu, reps = 100)2. Parameter Selection
2.1 Number of Repetitions (reps)
| Dataset Size | Recommended reps
|
|---|---|
| < 5,000 cells | 100-150 |
| 5,000-20,000 cells | 100 |
| > 20,000 cells | 50-100 |
Rule of thumb: More repetitions = more stable results, but longer runtime.
3. Computational Considerations
3.1 Parallel Processing
# Check available cores
parallel::detectCores()
# Use all but one core
result <- MultiK(seu, cores = parallel::detectCores() - 1)
# For HPC/cluster environments with limited memory per core
result <- MultiK(seu, cores = 8)3.2 Memory Management
For large datasets:
# Reduce features to save memory
seu <- FindVariableFeatures(seu, nfeatures = 1500)
# Reduce PCA dimensions
result <- MultiK(seu, nPC = 20)
# Reduce reps if memory-constrained
result <- MultiK(seu, reps = 50)4. Interpreting Results
4.1 Clear Optimal K
Ideal scenario: - Single peak in K frequency distribution - Low rPAC at that K - Pareto-optimal point stands out
# Result is straightforward
optK <- result$optimal_k
clusters <- getClusters(seu, optK = optK)4.2 Multiple Candidate K Values
When multiple K values appear Pareto-optimal:
# Consider biological context
# Lower K: Major cell types
# Higher K: Subtypes/states
# Examine both
clusters_low <- getClusters(seu, optK = 3)
clusters_high <- getClusters(seu, optK = 5)
# Use SigClust to help decide
pval_low <- CalcSigClust(seu, clusters_low$clusters[, 1])
pval_high <- CalcSigClust(seu, clusters_high$clusters[, 1])4.3 Hierarchical Relationships
# If K=5 is optimal but K=3 also looks good
# Check if 5 clusters = 3 major + 2 subtypes
# Run at both K values
PlotSigClust(seu, clusters_low$clusters[, 1], pval_low)
PlotSigClust(seu, clusters_high$clusters[, 1], pval_high)5. Troubleshooting
5.2 High PAC for All K
Cause: Data may have continuous structure rather than discrete clusters.
Solutions:
# Check for batch effects
DimPlot(seu, group.by = "batch")
# Consider trajectory analysis instead
# Or accept that data may have transitional populations5.3 SigClust Returns NA
Cause: Too few cells in a cluster.
Solutions:
# Use lower K to get larger clusters
clusters <- getClusters(seu, optK = optimal_k - 1)
# Or increase nsim
pval <- CalcSigClust(seu, clusters$clusters[, 1], nsim = 500)5.4 Long Runtime
Solutions:
# Reduce resolution granularity
result <- MultiK(seu, resolution = seq(0.1, 2, 0.1))
# Use more cores
result <- MultiK(seu, cores = parallel::detectCores())
# Reduce reps (minimum ~50 for stability)
result <- MultiK(seu, reps = 50)
# Subsample large datasets first
seu_sub <- seu[, sample(ncol(seu), 10000)]
result <- MultiK(seu_sub, reps = 100)6. Validation Strategies
6.1 Biological Validation
# Check known markers
FeaturePlot(seu, features = c("CD3D", "CD14", "MS4A1"))
# Compare to reference
# (if you have annotated reference data)7. Reporting Guidelines
When publishing results using MultiK, report:
- Parameters used: reps, pSample, resolution range, nPC
- Optimal K selected and rationale
- Diagnostic plots (Figure)
- SigClust p-values for cluster validation
-
MultiK version:
packageVersion("MultiK")
Example Methods Text
“Optimal cluster number was determined using the MultiK algorithm (v1.0.0, Liu 2025). We performed 100 subsampling iterations with 80% cell sampling, testing resolution parameters from 0.05 to 2.0 in increments of 0.05. The optimal K was selected based on the Pareto frontier of frequency and stability (rPAC). Cluster significance was validated using pairwise SigClust tests with 100 simulations.”