Skip to contents

Introduction

The iTALK ligand-receptor database uses human gene symbols (e.g., TGFB1, VEGFA). This creates a challenge when analyzing data from other species like mouse, where gene symbols follow different conventions (e.g., Tgfb1, Vegfa).

This vignette describes iTALK’s automatic cross-species conversion system, which enables seamless analysis of non-human data through ortholog mapping via Ensembl BioMart.

Species Detection

Gene Naming Conventions

Different species follow distinct gene naming patterns:

Species Convention Examples
Human ALL UPPERCASE TGFB1, VEGFA, CD8A
Mouse Title Case Tgfb1, Vegfa, Cd8a
Rat Title Case Tgfb1, Vegfa, Cd8a

Automatic Detection

library(iTALK)

# Human genes
human_result <- detect_species(c("TGFB1", "VEGFA", "IL6", "TNF", "CD8A"))
cat("Human detection:\n")
#> Human detection:
cat("  Species:", human_result$species, "\n")
#>   Species: Homo_sapiens
cat("  Confidence:", round(human_result$confidence * 100, 1), "%\n")
#>   Confidence: 100 %
cat("  Method:", human_result$method, "\n\n")
#>   Method: uppercase_pattern

# Mouse genes
mouse_result <- detect_species(c("Tgfb1", "Vegfa", "Il6", "Tnf", "Cd8a"))
cat("Mouse detection:\n")
#> Mouse detection:
cat("  Species:", mouse_result$species, "\n")
#>   Species: Mus_musculus
cat("  Confidence:", round(mouse_result$confidence * 100, 1), "%\n")
#>   Confidence: 100 %
cat("  Method:", mouse_result$method, "\n\n")
#>   Method: titlecase_pattern

# Mixed (ambiguous)
mixed_result <- detect_species(c("TGFB1", "Vegfa", "IL6", "Tnf"))
cat("Mixed detection:\n")
#> Mixed detection:
cat("  Species:", mixed_result$species, "\n")
#>   Species: unknown
cat("  Confidence:", round(mixed_result$confidence * 100, 1), "%\n")
#>   Confidence: 50 %

Detection Algorithm

┌─────────────────────────────────────┐
│         Input Gene List             │
└────────────────┬────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────┐
│  Sample up to 100 unique genes      │
│  Filter: length ≥ 3, contains A-Za-z│
└────────────────┬────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────┐
│     Pattern Matching                │
│  Human: ^[A-Z0-9]+$                 │
│  Mouse: ^[A-Z][a-z0-9]+[A-Za-z0-9]*$│
└────────────────┬────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────┐
│     Calculate Proportions           │
│  prop_human = n_human / n_total     │
│  prop_mouse = n_mouse / n_total     │
└────────────────┬────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────┐
│     Threshold Check (default: 70%)  │
│  if prop_human ≥ 0.7 → Homo_sapiens │
│  if prop_mouse ≥ 0.7 → Mus_musculus │
│  else → unknown                     │
└─────────────────────────────────────┘

Ortholog Mapping via BioMart

How It Works

When mouse genes are detected, iTALK queries Ensembl BioMart to retrieve ortholog mappings:

# Manual conversion example
conversion <- convert_species_biomart(
  genes = c("Tgfb1", "Vegfa", "Ctnnb1", "Cd8a", "Ptprc"),
  from_species = "Mus_musculus",
  to_species = "Homo_sapiens",
  ensembl_version = 103,  # Fixed version for reproducibility
  cache = TRUE
)

# View mapping results
conversion$mapping
#>   from_gene to_gene
#> 1    Tgfb1   TGFB1
#> 2    Vegfa   VEGFA
#> 3   Ctnnb1  CTNNB1
#> 4     Cd8a    CD8A
#> 5    Ptprc   PTPRC

# Statistics
conversion$stats
#> $n_input: 5
#> $n_mapped: 5
#> $mapping_rate: 1.0

BioMart Query Details

The query retrieves the associated_gene_name attribute for orthologs:

Dataset: mmusculus_gene_ensembl
Filter: external_gene_name (mouse symbols)
Attribute: hsapiens_homolog_associated_gene_name

Caching System

To avoid repeated BioMart queries, results are cached locally:

Cache location: ~/.Rcache/
Cache key: hash(genes) + species + ensembl_version
Cache format: R.cache RDS files

First query: ~15 seconds (network dependent)
Cached query: < 1 second

Automatic Conversion in FindLR

Seamless Workflow

When convert_species = TRUE (default), FindLR() automatically handles species conversion:

# Mouse data - automatic conversion
mouse_genes <- rawParse(mouse_data, top_genes = 50)

lr_pairs <- FindLR(
  data_1 = mouse_genes,
  datatype = "mean count",
  comm_type = "cytokine",
  convert_species = TRUE  # Default
)

# Console output:
# Detected species: Mus_musculus (95.2%)
# Converting mouse genes to human orthologs...
# Mapping complete: 847/1000 genes mapped (84.7%)

Disabling Auto-Conversion

For human data or when conversion is not desired:

lr_pairs <- FindLR(
  data_1 = human_genes,
  datatype = "mean count",
  comm_type = "cytokine",
  convert_species = FALSE
)

Mapping Rates and Considerations

Typical Mapping Rates

Conversion Mapping Rate Notes
Mouse → Human 85-95% Most comprehensive
Rat → Human 80-90% Good coverage
Other mammals 70-85% Variable

One-to-Many Mappings

Some genes have multiple orthologs. iTALK handles these by:

  1. Keeping all mappings in the conversion result
  2. Using aggregation (mean/sum/max) for expression matrices
# Convert expression matrix with one-to-many handling
converted <- convert_expression_matrix(
  expr_matrix = mouse_expr,
  gene_mapping = conversion$mapping,
  handle_duplicates = "mean"  # Options: "mean", "sum", "max"
)

Unmapped Genes

Genes without orthologs are:

  • Listed in conversion$unmapped
  • Excluded from downstream analysis
  • Logged in console messages
# Check unmapped genes
length(conversion$unmapped)
head(conversion$unmapped)
# Typically includes: pseudogenes, species-specific genes, novel transcripts

Advanced Configuration

Using Different Ensembl Versions

# Use specific version for reproducibility
conversion <- convert_species_biomart(
  genes = mouse_genes,
  from_species = "Mus_musculus",
  ensembl_version = 103  # Or "current_release" for latest
)

Mirror Selection

For faster access from different regions:

conversion <- convert_species_biomart(
  genes = mouse_genes,
  from_species = "Mus_musculus",
  mirror = "uswest"  # Options: "www", "uswest", "useast", "asia"
)

SSL Configuration

For environments with SSL certificate issues:

# Disable SSL verification (use with caution)
Sys.setenv(BIOMART_SSL_VERIFY = "0")

# Then run conversion
conversion <- convert_species_biomart(genes = mouse_genes, ...)

Performance Benchmarks

Performance benchmarks (typical workstation)
Operation Genes Time Memory
Species detection 1000 < 0.1s < 1 MB
BioMart query (first) 1000 ~15s ~10 MB
BioMart query (cached) 1000 < 1s < 1 MB
Full FindLR with conversion 1000 ~20s ~15 MB

Troubleshooting

Common Issues

1. BioMart connection timeout

# Increase retry attempts
conversion <- convert_species_biomart(
  genes = mouse_genes,
  from_species = "Mus_musculus",
  max_tries = 10
)

2. Low mapping rate - Check for non-standard gene symbols - Verify species detection is correct - Some genes may be species-specific

3. Cache issues

# Clear cache directory
unlink("~/.Rcache", recursive = TRUE)

Summary

Key points about cross-species analysis in iTALK:

  1. Automatic - Species detection and conversion happen transparently
  2. Accurate - Uses Ensembl BioMart for validated ortholog mappings
  3. Efficient - Intelligent caching minimizes redundant queries
  4. Flexible - Supports manual control when needed

Session Info

sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] iTALK_0.1.1
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          RcppArmadillo_15.2.3-1     
#>   [3] jsonlite_2.0.0              shape_1.4.6.1              
#>   [5] magrittr_2.0.4              modeltools_0.2-24          
#>   [7] farver_2.1.2                rmarkdown_2.30             
#>   [9] GlobalOptions_0.1.3         fs_1.6.6                   
#>  [11] zlibbioc_1.52.0             ragg_1.5.0                 
#>  [13] vctrs_0.7.0                 Cairo_1.7-0                
#>  [15] fastICA_1.2-7               scde_2.34.0                
#>  [17] progress_1.2.3              htmltools_0.5.9            
#>  [19] S4Arrays_1.6.0              curl_7.0.0                 
#>  [21] SparseArray_1.6.2           sass_0.4.10                
#>  [23] bslib_0.9.0                 HSMMSingleCell_1.26.0      
#>  [25] htmlwidgets_1.6.4           desc_1.4.3                 
#>  [27] plyr_1.8.9                  sandwich_3.1-1             
#>  [29] zoo_1.8-15                  cachem_1.1.0               
#>  [31] igraph_2.2.1                lifecycle_1.0.5            
#>  [33] pkgconfig_2.0.3             Matrix_1.7-4               
#>  [35] R6_2.6.1                    fastmap_1.2.0              
#>  [37] GenomeInfoDbData_1.2.13     MatrixGenerics_1.18.1      
#>  [39] digest_0.6.39               numDeriv_2016.8-1.1        
#>  [41] pcaMethods_1.98.0           colorspace_2.1-2           
#>  [43] miscTools_0.6-28            S4Vectors_0.44.0           
#>  [45] DESeq2_1.46.0               irlba_2.3.5.1              
#>  [47] textshaping_1.0.4           GenomicRanges_1.58.0       
#>  [49] extRemes_2.2-1              RMTstat_0.3.1              
#>  [51] mgcv_1.9-3                  httr_1.4.7                 
#>  [53] abind_1.4-8                 compiler_4.4.0             
#>  [55] brew_1.0-10                 S7_0.2.1                   
#>  [57] BiocParallel_1.40.2         viridis_0.6.5              
#>  [59] MASS_7.3-65                 quantreg_6.1               
#>  [61] MAST_1.32.0                 DelayedArray_0.32.0        
#>  [63] rjson_0.2.23                tools_4.4.0                
#>  [65] otel_0.2.0                  DDRTree_0.1.5              
#>  [67] nnet_7.3-20                 glue_1.8.0                 
#>  [69] nlme_3.1-168                grid_4.4.0                 
#>  [71] Rtsne_0.17                  cluster_2.1.8.1            
#>  [73] reshape2_1.4.5              generics_0.1.4             
#>  [75] gtable_0.3.6                monocle_2.34.0             
#>  [77] tidyr_1.3.2                 hms_1.1.4                  
#>  [79] data.table_1.18.0           flexmix_2.3-20             
#>  [81] XVector_0.46.0              BiocGenerics_0.52.0        
#>  [83] RANN_2.6.2                  pillar_1.11.1              
#>  [85] stringr_1.6.0               Lmoments_1.3-2             
#>  [87] limma_3.62.2                circlize_0.4.17            
#>  [89] splines_4.4.0               dplyr_1.1.4                
#>  [91] Rook_1.2                    lattice_0.22-7             
#>  [93] survival_3.8-3              SparseM_1.84-2             
#>  [95] gamlss.data_6.0-7           tidyselect_1.2.1           
#>  [97] SingleCellExperiment_1.28.1 locfit_1.5-9.12            
#>  [99] pbapply_1.7-4               randomcoloR_1.1.0.1        
#> [101] knitr_1.51                  gridExtra_2.3              
#> [103] V8_8.0.1                    IRanges_2.40.1             
#> [105] edgeR_4.4.2                 SummarizedExperiment_1.36.0
#> [107] stats4_4.4.0                xfun_0.56                  
#> [109] Biobase_2.66.0              statmod_1.5.1              
#> [111] matrixStats_1.5.0           pheatmap_1.0.13            
#> [113] leidenbase_0.1.36           stringi_1.8.7              
#> [115] VGAM_1.1-14                 UCSC.utils_1.2.0           
#> [117] statnet.common_4.13.0       yaml_2.3.12                
#> [119] evaluate_1.0.5              codetools_0.2-20           
#> [121] bbmle_1.0.25.1              DEsingle_1.26.0            
#> [123] tibble_3.3.1                cli_3.6.5                  
#> [125] systemfonts_1.3.1           jquerylib_0.1.4            
#> [127] network_1.19.0              dichromat_2.0-0.1          
#> [129] pscl_1.5.9                  Rcpp_1.1.1                 
#> [131] GenomeInfoDb_1.42.3         coda_0.19-4.1              
#> [133] bdsmatrix_1.3-7             parallel_4.4.0             
#> [135] MatrixModels_0.5-4          pkgdown_2.1.3              
#> [137] ggplot2_4.0.1               prettyunits_1.2.0          
#> [139] gamlss.dist_6.1-1           viridisLite_0.4.2          
#> [141] mvtnorm_1.3-3               slam_0.1-55                
#> [143] scales_1.4.0                gamlss_5.5-0               
#> [145] purrr_1.2.1                 crayon_1.5.3               
#> [147] combinat_0.0-8              distillery_1.2-2           
#> [149] maxLik_1.5-2.1              rlang_1.1.7