Compare CDR3s from sequencing to known CDR3s in the vdjdb

Matching CDR3s from sequencing with CDR3s of known antigen specificity may help to narrow potential antigen specificities of one's sequenced TCRs. Export a tsv of all or selected entries here https://vdjdb.cdr3.net/search. Provide that tsv as data frame (vdjdb). From vdjdb and tcrs distinct rows of tr_cols and cdr3_col are compared by levensthein distance and if selected (PAM30_similarity) by amino acid similarity. Returned data frame may be used to join original columns from vdjdb and tcrs followed by plotting with ggplot2 (see example). Original column names (vdj_tr_col, tcr_tr_col and vdj_cdr3_col, tcr_cdr3_col) are kept in the returned data frame to allow easy subsequent joining of vdjdb and tcrs. It requires though that these columns have different names in vdjdb and tcrs.

vdjdb_hits(
  vdjdb,
  tcrs,
  vdj_tr_col = "Gene",
  tcr_tr_col = "chain",
  vdj_cdr3_col = "CDR3",
  tcr_cdr3_col = "CDR3_aa_cr",
  TRAxTRB = F,
  max_lvdist = 3,
  PAM30_similarity = F,
  mapply_fun = mapply,
  lapply_fun = lapply,
  ...
)

Arguments

vdjdb	data frame of TCR data from vdjdb; vdj_cdr3_col and vdj_tr_col must be present; if not provided system.file("extdata", "vdjdb.tsv.rds", package = "igsc") is used (table downloaded 06/12/2021)
tcrs	data frame of TCR data from sequencing, tcr_cdr3_col and tcr_tr_col must be present; e.g. c_long from clonotype prep
vdj_tr_col	column which codes the TCR chain in the vdjdb reference data frame (should contain TRA and/or TRB only)
tcr_tr_col	column which codes the TCR chain in the tcrs data frame (should contain TRA and/or TRB only)
vdj_cdr3_col	column which codes the CDR3s in the vdjdb reference data frame
tcr_cdr3_col	column which codes the CDR3s in the tcrs data frame
TRAxTRB	logical, apart from comparing TRA vs TRA and TRB vs TRB between tcrs and vdjdb also check for TRA vs TRB?
max_lvdist	maximum levensthein distance between CDR3s for return data frame (also filtering before PAM30 similarity calculation)
PAM30_similarity	logical whether to calculate the similarity (pairwise alignment) between CDR3s based on PAM30 substitution matrix
mapply_fun	mapply function for PAM30_similarity, suggested are mapply, pbapply::pbmapply, parallel::mcmapply
lapply_fun	lapply function for levensthein distance calculation, suggested are lapply, pblapply::pblapply, parallel::mclapply
...	arguments passed to mapply_fun and lapply_fun, most relevant: mc.cores (e.g parallel::detectCores())

Value

data frame of matched CDR3s from tcrs and vdjdb, column lv indicates the levensthein between CDR3 from vdjdb and tcrs

Examples

if (FALSE) {
# tcrs data frame with sequencing data from TCRs has to be created (cl_long is appropriate)
matches <- vdjdb_hits(vdjdb, tcrs, mapply_fun = parallel::mcmapply, mc.cores = parallel::detectCores(), nthread = parallel::detectCores())
matches_append <- matches %>% dplyr::left_join(tcrs) %>% dplyr::left_join(vdjdb)

df <-
df %>%
dplyr::distinct(CDR3, CDR3_aa_cr, lv, Gene, chain, cl_name, patient, Epitope, Epitope.gene, Epitope.species, Score) %>%
tidyr::drop_na() %>%
dplyr::filter(cl_name %in% c("Madalyn", "Mikaylla", "Morgan")) %>%
dplyr::filter(lv < 3) %>%
dplyr::filter(Score > 0)

# use jitter for many hits (when beeswarms become overlapping)
ggplot(df, aes(x = Score, y = lv, color = Epitope.species)) +
 ggbeeswarm::geom_beeswarm(groupOnX = F, cex = 4, size = 2.5) +
 #geom_jitter(width = 0.15, height = 0.15, size = 3) +
 theme_bw() +
 theme(panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), strip.background = element_rect(fill = "white"), text = element_text(family = "Courier")) +
 scale_x_continuous(breaks = fcexpr::int_breaks(n = 3), limits = c(0.8,3.2)) +
 scale_y_reverse(breaks = fcexpr::int_breaks(n = 3)) +
 guides(fill = guide_legend(override.aes = list(size = 5)), color = guide_legend(override.aes = list(size = 5))) +
 facet_grid(rows = vars(chain), cols = vars(patient, cl_name))
}