Skip to contents

Best Practices

This vignette provides recommendations for optimal use of MultiK and solutions to common issues.

1. Data Preprocessing

1.1 Quality Control

Before running MultiK, ensure your data has been properly quality-controlled:

library(Seurat)

# Standard QC metrics
seu <- PercentageFeatureSet(seu, pattern = "^MT-", col.name = "percent.mt")

# Filter cells
seu <- subset(seu, subset = 
  nFeature_RNA > 200 & 
  nFeature_RNA < 5000 & 
  percent.mt < 20
)

MultiK handles normalization internally, but for consistency:

# If you want to pre-process
seu <- NormalizeData(seu)
seu <- FindVariableFeatures(seu, nfeatures = 2000)

# Pass to MultiK - it will scale and run PCA
result <- MultiK(seu, reps = 100)

2. Parameter Selection

2.1 Number of Repetitions (reps)

Dataset Size Recommended reps
< 5,000 cells 100-150
5,000-20,000 cells 100
> 20,000 cells 50-100

Rule of thumb: More repetitions = more stable results, but longer runtime.

2.2 Subsampling Proportion (pSample)

# Default: 80% of cells
result <- MultiK(seu, pSample = 0.8)

# For very small datasets (< 500 cells), consider higher
result <- MultiK(seu, pSample = 0.9)

# For large datasets, 80% is usually sufficient

2.3 Resolution Range

# Default covers most use cases
result <- MultiK(seu, resolution = seq(0.05, 2, 0.05))

# If you expect few clusters (< 5)
result <- MultiK(seu, resolution = seq(0.05, 1, 0.05))

# If you expect many clusters (> 15)
result <- MultiK(seu, resolution = seq(0.1, 3, 0.1))

2.4 PCA Dimensions (nPC)

# Determine optimal PCs using elbow plot
ElbowPlot(seu, ndims = 50)

# Use ~80-90% of variance captured
result <- MultiK(seu, nPC = 30)  # Default is often good

3. Computational Considerations

3.1 Parallel Processing

# Check available cores
parallel::detectCores()

# Use all but one core
result <- MultiK(seu, cores = parallel::detectCores() - 1)

# For HPC/cluster environments with limited memory per core
result <- MultiK(seu, cores = 8)

3.2 Memory Management

For large datasets:

# Reduce features to save memory
seu <- FindVariableFeatures(seu, nfeatures = 1500)

# Reduce PCA dimensions
result <- MultiK(seu, nPC = 20)

# Reduce reps if memory-constrained
result <- MultiK(seu, reps = 50)

3.3 Runtime Estimates

Cells Reps Cores Approximate Time
2,000 100 4 5-10 min
10,000 100 8 20-40 min
50,000 50 16 1-2 hours

4. Interpreting Results

4.1 Clear Optimal K

Ideal scenario: - Single peak in K frequency distribution - Low rPAC at that K - Pareto-optimal point stands out

# Result is straightforward
optK <- result$optimal_k
clusters <- getClusters(seu, optK = optK)

4.2 Multiple Candidate K Values

When multiple K values appear Pareto-optimal:

# Consider biological context
# Lower K: Major cell types
# Higher K: Subtypes/states

# Examine both
clusters_low <- getClusters(seu, optK = 3)
clusters_high <- getClusters(seu, optK = 5)

# Use SigClust to help decide
pval_low <- CalcSigClust(seu, clusters_low$clusters[, 1])
pval_high <- CalcSigClust(seu, clusters_high$clusters[, 1])

4.3 Hierarchical Relationships

# If K=5 is optimal but K=3 also looks good
# Check if 5 clusters = 3 major + 2 subtypes

# Run at both K values
PlotSigClust(seu, clusters_low$clusters[, 1], pval_low)
PlotSigClust(seu, clusters_high$clusters[, 1], pval_high)

5. Troubleshooting

5.1 “No valid consensus matrices found”

Cause: All clustering runs produced the same K.

Solutions:

# Expand resolution range
result <- MultiK(seu, resolution = seq(0.01, 3, 0.02))

# Increase reps
result <- MultiK(seu, reps = 200)

5.2 High PAC for All K

Cause: Data may have continuous structure rather than discrete clusters.

Solutions:

# Check for batch effects
DimPlot(seu, group.by = "batch")

# Consider trajectory analysis instead
# Or accept that data may have transitional populations

5.3 SigClust Returns NA

Cause: Too few cells in a cluster.

Solutions:

# Use lower K to get larger clusters
clusters <- getClusters(seu, optK = optimal_k - 1)

# Or increase nsim
pval <- CalcSigClust(seu, clusters$clusters[, 1], nsim = 500)

5.4 Long Runtime

Solutions:

# Reduce resolution granularity
result <- MultiK(seu, resolution = seq(0.1, 2, 0.1))

# Use more cores
result <- MultiK(seu, cores = parallel::detectCores())

# Reduce reps (minimum ~50 for stability)
result <- MultiK(seu, reps = 50)

# Subsample large datasets first
seu_sub <- seu[, sample(ncol(seu), 10000)]
result <- MultiK(seu_sub, reps = 100)

6. Validation Strategies

6.1 Biological Validation

# Check known markers
FeaturePlot(seu, features = c("CD3D", "CD14", "MS4A1"))

# Compare to reference
# (if you have annotated reference data)

6.2 Technical Validation

# Bootstrap validation
set.seed(123)
results <- lapply(1:10, function(i) {
  MultiK(seu, reps = 100, seed = i)$optimal_k
})

# Check consistency
table(unlist(results))

6.3 Cross-Validation

# Split data and check consistency
idx <- sample(ncol(seu), ncol(seu)/2)
seu1 <- seu[, idx]
seu2 <- seu[, -idx]

result1 <- MultiK(seu1, reps = 100)
result2 <- MultiK(seu2, reps = 100)

# Compare optimal K
c(result1$optimal_k, result2$optimal_k)

7. Reporting Guidelines

When publishing results using MultiK, report:

  1. Parameters used: reps, pSample, resolution range, nPC
  2. Optimal K selected and rationale
  3. Diagnostic plots (Figure)
  4. SigClust p-values for cluster validation
  5. MultiK version: packageVersion("MultiK")

Example Methods Text

“Optimal cluster number was determined using the MultiK algorithm (v1.0.0, Liu 2025). We performed 100 subsampling iterations with 80% cell sampling, testing resolution parameters from 0.05 to 2.0 in increments of 0.05. The optimal K was selected based on the Pareto frontier of frequency and stability (rPAC). Cluster significance was validated using pairwise SigClust tests with 100 simulations.”

Author

Zaoqu Liu, PhD