Part 2: Antigen Selection Pipeline

2.4. Manual Review

The pipeline should generate enough recommendations that many candidates can be tested, but not too many, which would make manual reviews difficult. My estimate would be 10-50, which means the current number of candidates (21) is perfect.

In the literature, there is a high degree of variability in this part of the result processing. A good idea is to start by translating protein IDs, which are given by NCBI as RefSeq IDs, to UniProt KB IDs. In other words, the proteins should be identified on UniProt, which offers a lot more information on entries - from known/inferred location to similar proteins. Known location is very important, as it's much more reliable than the location heuristically determined by a tool. With this information, we can start building a spreadsheet like the one below:

[spreadsheet Vaccine Candidates V2] [I wonder how to best embed this in the final interactive website]

DeepTMHMM is another useful tool to have in our arsenal. Using artificial intelligence, it predicts amino acid-level “localization”, telling us which part of the sequence will likely be on the outside of the cell, the membrane, and inside. BETA predictions are particularly relevant, as beta sheets are commonly transmembrane - meaning some parts of the antigen will be inside the cell, while other parts will be displayed on the outside.

The ability of a candidate to be a good antigen is also vital. The most commonly used tools to quantify this are VaxiJen 2, which returns an antigenicity score from 0 to 1 given a sequence, and VaxiJen 3, which gives a categorical answer (immunogen/non-immunogen) along with a probability. For all my pipelines, the latter probability was 66% or 100%, indicating that the pipeline and VaxiJen 3 may be sharing some of the same criteria for selecting antigens.

The most non-conventional step of the pipeline is the removal of accessory proteins, which the current pipeline does in a way not seen in previous literature. As such, it makes sense for the manual review to ‘look back’ and see how well-conserved the proposed antigens are across all strains. This can be done using DIAMOND, this time running each genome against a database generated from the candidates for efficiency. To paint a complete picture, we can check for matches with 99% identity, as well as 95%, 90%, 80%, 50%, and 10%.

step5_start.sh

While the output of this step may be a bit chaotic to read, it’s fairly easy to track progress via an additional script:

step5_progress.sh

Armed with all the data, suggestions can be made. It’s normal for a few proteins to ‘slip past’ the pipeline’s checks and make it to the manual review stage while not being good candidates. This can happen, given that tools are not 100% accurate and no step of the pipeline checks for actual localization. These are easy and fast to exclude. There are also some candidates easy to recommend for lab testing - they meet all criteria and score well on benchmarks (including conservation across strains). It’s really exciting when some of the proposed candidates have not been previously studied but show high potential - which is the case for the last 3 proteins produced by our pipeline!

There is, however, a third category: proteins with mixed signals. These don’t look like excellent candidates, but may turn out to be. I put these in the ‘needs more research’ category - a more comprehensive review of existing literature around them (or their general protein class) may be needed before making a call. Then, combined with data and the possible constraint on the number of final candidates to be recommended, these may be tested in a lab or not.

To check the quality of the pipeline, it makes sense to look at processes with similar goals (proposing A. baumannii vaccine candidates) in the literature and compare results. While it’s exciting that the pipeline found new candidates, missing the ones proposed by the literature without good justification is a concern. This is, however, not the case here, as 4 out of the 8 candidates we would’ve recommended for lab analysis have been suggested by other works in the literature as well: KBP-type peptidyl-prolyl cis-trans isomerase; Multidrug efflux RND transporter outer membrane channel subunit AdeK & Multidrug efflux RND transporter periplasmic adaptor subunit AdeI; type IV pilus biogenesis stability protein

Another study reported another outer membrane protein than Omp38, OmpA.