Part 2: Antigen Selection Pipeline
2.2.1. Gathering Data
Algorithm Card
Input: Eligibility criteria for genomes (e.g., exclude unusual assemblies)
Output: Folder (all_proteins/) containing a FASTA file with protein sequences for every genome matching eligibility criteria
Brief Summary: Uses NCBI to download genomes and applies appropriate post-processing to make the data easily used by the next steps
Input: Eligibility criteria for genomes (e.g., exclude unusual assemblies) Output: Folder (all_proteins/) containing a FASTA file with protein sequences for every genome matching eligibility criteria Brief Summary: Uses NCBI to download genomes and applies appropriate post-processing to make the data easily used by the next steps
Throughout these steps, we’ll get to know the characteristics of good antigens. The antigens we’ll deal with here are all proteins - types of molecules that make up most of a cell’s machinery. Proteins are used for signaling, sensing, facilitating reactions, and much more. To start searching for proteins, we’ll first need data about all proteins that show up in different strains of A. baumannii. We’ll do that by using public data uploaded from labs around the world to the National Center for Biotechnology Information, which thankfully offers CLIs that make the process extremely easy to run from your terminal.
The commands will take a few minutes at most to run. At the time of writing this guide, 913 genomes match the criteria specified in the command. We want only genomes of our target bacteria that are annotated (meaning labeled), normal (not atypical), and released within the past 20 years. More importantly, we want protein data - we’re not interested in the genetic sequence itself or other data, but rather only known proteins in the genome.
The script below not only downloads all the genomes matching the criteria, but also processes them, as the file structure from NCBI can be a bit tricky to work with. Namely, all proteomes (collection of proteins from a genome) are in a file called ‘protein.faa’ inside ‘ncbi_dataset/data/[genome_id].’ Having the files contain the genome name will make everything much easier in the future. Moreover, the find command replaces the rare occurrences of an amino acid code that’s not understood by all the tools we’re going to use, making everything run smoothly moving forward.
step0_download_all_proteins.shIf we open a file such as ‘all_proteins/GCF_000018445.1.fasta’ to see what it contains, the FASTA file format becomes quite clear:
>WP_000002861.1 MULTISPECIES: acyclic terpene utilization AtuA family protein [Acinetobacter]
MANNQQDDHRVVKIGCASGFWGDTNTAAFQLVHLTDINYLVFDYLSEITMSIMAKAKMVEPKHGYALDFVSRVMAPLLKK
IAEKKIKVISNAGGVNPLACRDALQKIIKEYGLDLKVAVVLGDDLLPKHEQLKSQNIQEMFSGEALPEQVASSNAYLGAV
AIRDALDLGADIVITGRVVDSAVVLAPLLHEYQWPLDDYDKLAQGSLAGHVIECGAQCTGGNFTDWQLVQGFDNMGFPVV
EVSEDGSFVVTKPQGTGGLVSTATVAEQIVYEIGNPQAYLLPDVIADFSHVHLEQVGEHRVRVTGAKGQAPTTQYKVSAT
YPDGYRVLVSFLIAGREAPQKAQVIADAILTKCERVLAMRSVPPFSEKSVEILGIESTYGDHAQTLNSREVVVKIAVKHM
FKEACMFFASEIAQASTGMAPALAGIVGGRPKASPVIKLFSFLIDKNQVNVEIDFDGQRHAVEIPQGVSTEQLLTLTAGE
NAVYQGDEIEVPLIEIAHARSGDKGNHSNIGVIARKADYLPWIRAALTEQSVASYMQHVLDAEKGRVIRYELPGLNALNF
MLENALGGGGVASLRIDPQGKAFAQQLLDMPVKVPAHLLEK
>WP_000003114.1 MULTISPECIES: alpha/beta hydrolase [Acinetobacter]
MSEQIFIQGPVGKIELFVDRPEGEIKGFAVVCHPHPLQGGTPQHKVPALLTQIFNEYGCIVYRPSFRGLGGSEGVHDEGH
GETEDILAVIEHVRKLHAGLPFYAGGFSFGSHVLAKCHAQLSPELQPIQLILCGLPTATVVGLRHYKTPEIQGDILLIHG
EQDDITLLSDAIEWAKPQKHPITILPGANHFFTGYLKQLRQIITRFIIMK
>WP_000003220.1 MULTISPECIES: hypothetical protein [Acinetobacter]
MSEQKIIDLIKASQAVIKNELLPQSGSQKYNLLMLMRSLEILQAYILQKDTCTLHRSGILQDYFSFPIKDIDEATQLFIS
DIREGKQSDQTFETLKALNLEELKITEPKVANHG
>WP_000003406.1 MULTISPECIES: sulfurtransferase TusA [Gammaproteobacteria]
MSEQPISPTVQLNTRGLRCPEPVMMLHQAIRKAKSGDVVEVLATDPSTSWDIPKFCMHLGHELLLKEEVLDEQNHKEYRY
LVQKGWhile the text might seem daunting at first, its structure is fairly simple. Lines starting with ‘>’ are comments that are usually used to name proteins. As you can see above, all names start with an identifier, followed by a clear text name of the proteins. The next lines contain lines of different lengths that describe the protein by its amino acid sequence. Proteins are built out of a sequence of just 20 building blocks called amino acids - these are encoded by the mRNA (more on this in Part 3!). This also brings up the last find command - some protein sequences from NCBI (less than 1%) contain ‘J’s, which are not valid amino acids for most programs. Without getting into much detail, the Js can be treated as leucine (L).
The data currently available on your hard drive can be thought of as a haystack. The antigens we’re looking for are needles. Thankfully, today’s bioinformatic tools work like metal detectors and magnets: they let us get a lot of data just from protein sequences alone. By continuously removing proteins that will not be good antigens, we will be left with just a few proteins that might be. This will allow us to take a better look at all of them and ultimately propose candidates for lab testing.
The last point before moving to the first filtering step is to try and follow the order of the filtering operations as well. We’ll go from computationally inexpensive tests that can be performed on a lot of data to more computationally intensive tasks - and ultimately tasks that require manual intervention or a lot of human consideration. This is a feature of the pipeline: it tries to filter out as many proteins as possible with the least amount of effort in order to minimize the time it takes to run it (and our work).