Cross-Species Analysis
Zaoqu Liu
2026-01-23
Source:vignettes/species-conversion.Rmd
species-conversion.RmdIntroduction
The iTALK ligand-receptor database uses human gene symbols (e.g., TGFB1, VEGFA). This creates a challenge when analyzing data from other species like mouse, where gene symbols follow different conventions (e.g., Tgfb1, Vegfa).
This vignette describes iTALK’s automatic cross-species conversion system, which enables seamless analysis of non-human data through ortholog mapping via Ensembl BioMart.
Species Detection
Gene Naming Conventions
Different species follow distinct gene naming patterns:
| Species | Convention | Examples |
|---|---|---|
| Human | ALL UPPERCASE | TGFB1, VEGFA, CD8A |
| Mouse | Title Case | Tgfb1, Vegfa, Cd8a |
| Rat | Title Case | Tgfb1, Vegfa, Cd8a |
Automatic Detection
library(iTALK)
# Human genes
human_result <- detect_species(c("TGFB1", "VEGFA", "IL6", "TNF", "CD8A"))
cat("Human detection:\n")
#> Human detection:
cat(" Species:", human_result$species, "\n")
#> Species: Homo_sapiens
cat(" Confidence:", round(human_result$confidence * 100, 1), "%\n")
#> Confidence: 100 %
cat(" Method:", human_result$method, "\n\n")
#> Method: uppercase_pattern
# Mouse genes
mouse_result <- detect_species(c("Tgfb1", "Vegfa", "Il6", "Tnf", "Cd8a"))
cat("Mouse detection:\n")
#> Mouse detection:
cat(" Species:", mouse_result$species, "\n")
#> Species: Mus_musculus
cat(" Confidence:", round(mouse_result$confidence * 100, 1), "%\n")
#> Confidence: 100 %
cat(" Method:", mouse_result$method, "\n\n")
#> Method: titlecase_pattern
# Mixed (ambiguous)
mixed_result <- detect_species(c("TGFB1", "Vegfa", "IL6", "Tnf"))
cat("Mixed detection:\n")
#> Mixed detection:
cat(" Species:", mixed_result$species, "\n")
#> Species: unknown
cat(" Confidence:", round(mixed_result$confidence * 100, 1), "%\n")
#> Confidence: 50 %Detection Algorithm
┌─────────────────────────────────────┐
│ Input Gene List │
└────────────────┬────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Sample up to 100 unique genes │
│ Filter: length ≥ 3, contains A-Za-z│
└────────────────┬────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Pattern Matching │
│ Human: ^[A-Z0-9]+$ │
│ Mouse: ^[A-Z][a-z0-9]+[A-Za-z0-9]*$│
└────────────────┬────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Calculate Proportions │
│ prop_human = n_human / n_total │
│ prop_mouse = n_mouse / n_total │
└────────────────┬────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Threshold Check (default: 70%) │
│ if prop_human ≥ 0.7 → Homo_sapiens │
│ if prop_mouse ≥ 0.7 → Mus_musculus │
│ else → unknown │
└─────────────────────────────────────┘
Ortholog Mapping via BioMart
How It Works
When mouse genes are detected, iTALK queries Ensembl BioMart to retrieve ortholog mappings:
# Manual conversion example
conversion <- convert_species_biomart(
genes = c("Tgfb1", "Vegfa", "Ctnnb1", "Cd8a", "Ptprc"),
from_species = "Mus_musculus",
to_species = "Homo_sapiens",
ensembl_version = 103, # Fixed version for reproducibility
cache = TRUE
)
# View mapping results
conversion$mapping
#> from_gene to_gene
#> 1 Tgfb1 TGFB1
#> 2 Vegfa VEGFA
#> 3 Ctnnb1 CTNNB1
#> 4 Cd8a CD8A
#> 5 Ptprc PTPRC
# Statistics
conversion$stats
#> $n_input: 5
#> $n_mapped: 5
#> $mapping_rate: 1.0Automatic Conversion in FindLR
Seamless Workflow
When convert_species = TRUE (default),
FindLR() automatically handles species conversion:
# Mouse data - automatic conversion
mouse_genes <- rawParse(mouse_data, top_genes = 50)
lr_pairs <- FindLR(
data_1 = mouse_genes,
datatype = "mean count",
comm_type = "cytokine",
convert_species = TRUE # Default
)
# Console output:
# Detected species: Mus_musculus (95.2%)
# Converting mouse genes to human orthologs...
# Mapping complete: 847/1000 genes mapped (84.7%)Disabling Auto-Conversion
For human data or when conversion is not desired:
lr_pairs <- FindLR(
data_1 = human_genes,
datatype = "mean count",
comm_type = "cytokine",
convert_species = FALSE
)Mapping Rates and Considerations
Typical Mapping Rates
| Conversion | Mapping Rate | Notes |
|---|---|---|
| Mouse → Human | 85-95% | Most comprehensive |
| Rat → Human | 80-90% | Good coverage |
| Other mammals | 70-85% | Variable |
One-to-Many Mappings
Some genes have multiple orthologs. iTALK handles these by:
- Keeping all mappings in the conversion result
- Using aggregation (mean/sum/max) for expression matrices
# Convert expression matrix with one-to-many handling
converted <- convert_expression_matrix(
expr_matrix = mouse_expr,
gene_mapping = conversion$mapping,
handle_duplicates = "mean" # Options: "mean", "sum", "max"
)Advanced Configuration
Using Different Ensembl Versions
# Use specific version for reproducibility
conversion <- convert_species_biomart(
genes = mouse_genes,
from_species = "Mus_musculus",
ensembl_version = 103 # Or "current_release" for latest
)Mirror Selection
For faster access from different regions:
conversion <- convert_species_biomart(
genes = mouse_genes,
from_species = "Mus_musculus",
mirror = "uswest" # Options: "www", "uswest", "useast", "asia"
)SSL Configuration
For environments with SSL certificate issues:
# Disable SSL verification (use with caution)
Sys.setenv(BIOMART_SSL_VERIFY = "0")
# Then run conversion
conversion <- convert_species_biomart(genes = mouse_genes, ...)Performance Benchmarks
| Operation | Genes | Time | Memory |
|---|---|---|---|
| Species detection | 1000 | < 0.1s | < 1 MB |
| BioMart query (first) | 1000 | ~15s | ~10 MB |
| BioMart query (cached) | 1000 | < 1s | < 1 MB |
| Full FindLR with conversion | 1000 | ~20s | ~15 MB |
Troubleshooting
Common Issues
1. BioMart connection timeout
# Increase retry attempts
conversion <- convert_species_biomart(
genes = mouse_genes,
from_species = "Mus_musculus",
max_tries = 10
)2. Low mapping rate - Check for non-standard gene symbols - Verify species detection is correct - Some genes may be species-specific
3. Cache issues
# Clear cache directory
unlink("~/.Rcache", recursive = TRUE)Summary
Key points about cross-species analysis in iTALK:
- Automatic - Species detection and conversion happen transparently
- Accurate - Uses Ensembl BioMart for validated ortholog mappings
- Efficient - Intelligent caching minimizes redundant queries
- Flexible - Supports manual control when needed
Session Info
sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C
#>
#> time zone: Asia/Shanghai
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] iTALK_0.1.1
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 RcppArmadillo_15.2.3-1
#> [3] jsonlite_2.0.0 shape_1.4.6.1
#> [5] magrittr_2.0.4 modeltools_0.2-24
#> [7] farver_2.1.2 rmarkdown_2.30
#> [9] GlobalOptions_0.1.3 fs_1.6.6
#> [11] zlibbioc_1.52.0 ragg_1.5.0
#> [13] vctrs_0.7.0 Cairo_1.7-0
#> [15] fastICA_1.2-7 scde_2.34.0
#> [17] progress_1.2.3 htmltools_0.5.9
#> [19] S4Arrays_1.6.0 curl_7.0.0
#> [21] SparseArray_1.6.2 sass_0.4.10
#> [23] bslib_0.9.0 HSMMSingleCell_1.26.0
#> [25] htmlwidgets_1.6.4 desc_1.4.3
#> [27] plyr_1.8.9 sandwich_3.1-1
#> [29] zoo_1.8-15 cachem_1.1.0
#> [31] igraph_2.2.1 lifecycle_1.0.5
#> [33] pkgconfig_2.0.3 Matrix_1.7-4
#> [35] R6_2.6.1 fastmap_1.2.0
#> [37] GenomeInfoDbData_1.2.13 MatrixGenerics_1.18.1
#> [39] digest_0.6.39 numDeriv_2016.8-1.1
#> [41] pcaMethods_1.98.0 colorspace_2.1-2
#> [43] miscTools_0.6-28 S4Vectors_0.44.0
#> [45] DESeq2_1.46.0 irlba_2.3.5.1
#> [47] textshaping_1.0.4 GenomicRanges_1.58.0
#> [49] extRemes_2.2-1 RMTstat_0.3.1
#> [51] mgcv_1.9-3 httr_1.4.7
#> [53] abind_1.4-8 compiler_4.4.0
#> [55] brew_1.0-10 S7_0.2.1
#> [57] BiocParallel_1.40.2 viridis_0.6.5
#> [59] MASS_7.3-65 quantreg_6.1
#> [61] MAST_1.32.0 DelayedArray_0.32.0
#> [63] rjson_0.2.23 tools_4.4.0
#> [65] otel_0.2.0 DDRTree_0.1.5
#> [67] nnet_7.3-20 glue_1.8.0
#> [69] nlme_3.1-168 grid_4.4.0
#> [71] Rtsne_0.17 cluster_2.1.8.1
#> [73] reshape2_1.4.5 generics_0.1.4
#> [75] gtable_0.3.6 monocle_2.34.0
#> [77] tidyr_1.3.2 hms_1.1.4
#> [79] data.table_1.18.0 flexmix_2.3-20
#> [81] XVector_0.46.0 BiocGenerics_0.52.0
#> [83] RANN_2.6.2 pillar_1.11.1
#> [85] stringr_1.6.0 Lmoments_1.3-2
#> [87] limma_3.62.2 circlize_0.4.17
#> [89] splines_4.4.0 dplyr_1.1.4
#> [91] Rook_1.2 lattice_0.22-7
#> [93] survival_3.8-3 SparseM_1.84-2
#> [95] gamlss.data_6.0-7 tidyselect_1.2.1
#> [97] SingleCellExperiment_1.28.1 locfit_1.5-9.12
#> [99] pbapply_1.7-4 randomcoloR_1.1.0.1
#> [101] knitr_1.51 gridExtra_2.3
#> [103] V8_8.0.1 IRanges_2.40.1
#> [105] edgeR_4.4.2 SummarizedExperiment_1.36.0
#> [107] stats4_4.4.0 xfun_0.56
#> [109] Biobase_2.66.0 statmod_1.5.1
#> [111] matrixStats_1.5.0 pheatmap_1.0.13
#> [113] leidenbase_0.1.36 stringi_1.8.7
#> [115] VGAM_1.1-14 UCSC.utils_1.2.0
#> [117] statnet.common_4.13.0 yaml_2.3.12
#> [119] evaluate_1.0.5 codetools_0.2-20
#> [121] bbmle_1.0.25.1 DEsingle_1.26.0
#> [123] tibble_3.3.1 cli_3.6.5
#> [125] systemfonts_1.3.1 jquerylib_0.1.4
#> [127] network_1.19.0 dichromat_2.0-0.1
#> [129] pscl_1.5.9 Rcpp_1.1.1
#> [131] GenomeInfoDb_1.42.3 coda_0.19-4.1
#> [133] bdsmatrix_1.3-7 parallel_4.4.0
#> [135] MatrixModels_0.5-4 pkgdown_2.1.3
#> [137] ggplot2_4.0.1 prettyunits_1.2.0
#> [139] gamlss.dist_6.1-1 viridisLite_0.4.2
#> [141] mvtnorm_1.3-3 slam_0.1-55
#> [143] scales_1.4.0 gamlss_5.5-0
#> [145] purrr_1.2.1 crayon_1.5.3
#> [147] combinat_0.0-8 distillery_1.2-2
#> [149] maxLik_1.5-2.1 rlang_1.1.7