Part 2: Antigen Selection Pipeline

2.2.3. Homologous Protein Removal

Algorithm Card

Input: core.fasta

Output: non-homologous.fasta - Proteins from core.fasta that did not generate hits in the two DIAMOND runs

Brief Summary: Download and convert the human proteome and gut flora database to DIAMOND databases, then remove proteins that are too similar to those datasets.

Input: core.fasta Output: non-homologous.fasta - Proteins from ‘core.fasta’ that did not generate hits in the two DIAMONDs (i.e., they are non-homologous according to the criteria we set). Brief Summary: The reference human genome and the gut flora database will be downloaded and converted to a DIAMOND database. Proteins too similar to those in the human proteome (identity >= 40%; coverage >= 50%; e >= 0.005) or those in the database (identity >= 80%; coverage >= 50%; e >= 0.005) were eliminated.

Remember that the purpose of a vaccine is to make your body attack a specific antigen. Our dataset, however, certainly contains some proteins that we don’t want our antibodies to bind to: the ones that are too similar to proteins produced in human cells or gut microbiome bacteria.

The NCBI website can be used to find the reference human genome, GCF_000001405.40. The gut flora proteome can be found in the Gut Flora DataBase project. To compare the proteins we have in ‘core.fasta’ with each other, we’ll use a newer and faster version of the Basic Local Alignment Search Tool, DIAMOND, which can be downloaded from here.

step2.sh

The script above downloads the two proteomes if they don’t already exist and turns each of them into DIAMOND databases, which is required for matching our potential candidates against each dataset. The last command prints out matches of proteins from ‘core.fasta’ that have a >40% identity and >50% cover match with human proteins. These values are taken from similar analyses from the literature. The output will look like this:

yakuhito@fury-catstation:~/projects/capstone$ head -n 5 human.tab 
WP_000016932.1    NP_001244955.1    46.7    3.04e-38    134    336
WP_000025985.1    NP_000117.1    56.9    2.69e-105    312    800
WP_000035781.1    XP_024309946.1    49.3    4.01e-204    586    1511
WP_000043046.1    NP_954699.1    40.3    3.69e-48    160    404
WP_000045496.1    NP_000166.2    43.9    2.74e-152    450    1157

To remove matches from the ‘core.fasta’ file, we’ll use a simple utility script that will also be useful later, filter_fasta.py:

filter_fasta.py

The role of the utility script may be inferred from its help text:

python3 filter_fasta.py [include|excplude] input.fasta source.tab output.fasta

Armed with the utility script, the remainder of this step is made up of just 3 more commands that remove matches from the human proteome, do the same search on the gut flora database, and remove the matches from there as well. Running it the first time takes a few minutes at most:

This step leaves us with 2511 proteins inside ‘non-homologous.fasta’