← Back to main page

Part 2: Antigen Selection Pipeline

2.2.5. Essential proteins and virulence factors

Algorithm Card

Input: localized.fasta

Output: virulence.fasta - Proteins from localized.fasta that have adequate matches in both VFDB and DEG

Brief Summary: Filter remaining proteins based on expected virulence and essentiality.

Input: localized.fasta Output: virulence.fasta - Proteins from ‘localized.fasta’ that have adequate matches in both VFDB (Virulence Factor Database) and DEG . Brief Summary: Filter resulting proteins based on their expected virulence.

The last step in the pipeline - which is intended to return a small number of candidates that we can review manually - is to select only the remaining proteins that are likely to be essential and virulence factors. This can be done through methods similar to the ones used in [Homologous Protein Removal] using the Database of Essential Genes and the Virulence Factor DataBase:

step4.sh

Note that DEG may have some entries where the sequence of the protein reads “Not available” instead of a valid amino acid sequence. To address that, the script below processes all entries and removes the ones that don’t contain a full sequence, ensuring DIAMOND can read the final file.

step4_filter_deg.py

Parameters for the matches are once again taken from similar pipelines in literature. The ‘virulence.fasta’ file has 21 candidate proteins, which need to be manually analyzed.