Part 2: Antigen Selection Pipeline

2.2.2. Removing Accessory Proteins

Algorithm Card

Input: Files inside all_proteins/

Output: core.fasta - Proteins from the reference genome that are found in the core clusters.

Brief Summary: The R script clusters proteins together to establish a soft core set of proteins. Accessory proteins are removed from the reference genome FASTA, leaving proteins conserved across samples.

Input: Files inside all_proteins/ Output: core.fasta - Proteins from the reference genome that are found in the core clusters. Brief Summary: The R script will cluster proteins together to establish a soft core set of proteins. The other - accessory - proteins are removed from the reference genome FASTA, leaving only proteins that are conserved across samples. Criteria: Cluster proteins with coverage >= 0.8 and identity >= 0.5; eliminate those present in less than 99% of the genomes

One of the main characteristics of a good antigen is that it shows up in most bacterial strains. Currently, we have data on all proteins that show up in numerous A. baumannii strains. A lot of the proteins are accessory - they’re only found in smaller clusters, and can often be associated with functions that are not essential. Since we’re hunting for core proteins, it makes sense to get rid of proteins that certainly don’t meet the criteria early on.

But, as is often the case with biology, there’s a catch: proteins may not 100% match. As we said before, proteins are just a sequence of amino acids. However, if one amino acid changes to another with similar properties, the protein may still retain its function. It’s the same for multiple changes, or even additions or deletions. To make things easier, we’ll look for antigens in the reference proteome of A. baumannii, which is ‘GCF_009035845.1’ - you can find this by searching our bacteria of interest on the NCBI website.

This step will use the Pangenome Analysis Toolkit to cluster proteins together based on similarity and then only keep proteins that show up in over 95% of genomes. Before getting into more details, a few words about the setup, which is not straightforward to get right. PATO uses R, which can be installed on an Ubuntu machine as follows:

sudo apt install r-base

To install dependencies, you’ll want to run the following commands in an R console (obtained by simply executing the command ‘R’):

setRepositories(ind = c(1, 2))
install.packages(c('ggplot2', 'threejs', 'igraph', 'uwot', 'tidyr', 'dplyr', 'dtplyr', 'data.table', 'phangorn', 'randomcoloR', 'mclust', 'dbscan', 'ape', 'parallelDist', 'foreach', 'doParallel', 'manipulateWidget', 'stringdist', 'Biostrings', 'microseq'))
install.packages(c('XVector', 'GenomeInfoDb'))
BiocManager::install(c("Biostrings", "microseq"))
devtools::install_github("irycisBioinfo/PATO", build_vignettes = TRUE)

These commands will install PATO and its dependencies in the correct order. You may now run the script below, which should take a few minutes to execute:

step1_remove_accessory_proteins.r

The first lines load all the proteins from the ‘all_proteins’ directory. The MMseqs2 utility is then used to cluster proteins together. Matching criteria is defined by two parameters, identity and coverage. Identity represents the percentage of amino acids that are the same between two sequences, while coverage specifies the percentage of the length of a protein that is represented in alignment with the other. These parameters stem from the different ways proteins change between strains, as explained above: amino acids might change, but may be inserted or deleted as well.

With the clusters at hand, the rest of the script simply identifies proteins from the reference genome that belong to clusters where over 95% of the strains are represented. The 5% gives us wiggle room in case any genomes from NCBI have incomplete protein annotations, which is rare but sometimes happens. Also note that the clustering criteria are very loose - only proteins that are certainly accessory to the genome are removed at this step. Don’t worry - we’ll check the presence of antigen candidates across all strains again towards the end of the pipeline.

The output proteins will be written to ‘core.fasta’, which will contain around 2727 proteins. This is a big reduction from the 3670 proteins found in the reference genome.

Optional Challenge: Use the properties of the FASTA format to find a way to count the number of proteins in a file by using ‘grep.’