Part 3: Sequence Optimization
3.5. LinearFold & LinearDesign
The design space for the mRNA encoding of even a small protein is very, very large. For the SARS-CoV-2 spike protein, which is 1273 amino acids long, there are roughly 2.4e632 synonymous mRNA sequences - more than the number of atoms in the observable universe.. So, how does LinearDesign, the state-of-the-art tool in mRNA sequence optimization, propose a sequence with almost optimal MFECAI for the spike protein in just 11 minutes?
LinearDesign’s main innovation is viewing mRNA optimization as a classical problem from computational linguistics: finding the most likely sentence among many similar alternatives. Just like an old speech-to-text program searches a lattice of possible word sequences with constraints (the recorded sound and grammar rules), LinearDesign searches the immense design space constrained by the amino-acid sequence (only corresponding synonymous codons can be used), as well as RNA folding energy rules (from the Turner group in 2004).
The foundation of LinearDesign comes from LinearFold, which achieved the first linear-time RNA folding algorithm after more than 40 years of cubic-time algorithms (Zuker). LinearFold uses a specific sequence of mRNA bases while computing the resulting structure; LinearDesign, on the other hand, chooses codons while folding.
The easiest way to think about the search is through the lens of decision trees. The final base sequence is built codon by codon. At each codon position i, we can imagine a decision node (drawn as a square), where we can choose any of the synonymous codons that encode the required amino acid i (each with its own branch). Each choice leads to partial mRNA sequences, which can be used to characterize chance nodes (drawn as circles) with associated intermediate MFECAI scores (probability would correspond to the partition function, a concept only calculated and used in another ‘branch’ of LinearFold called LinearPartition). Terminal leaves are complete mRNA sequences, each with a corresponding MFECAI score.
Assuming each amino acid has 3 synonymous codons (close to the real average of 3.05), a simple algorithm traversing the whole tree would need ~3^1273 ~= 1e632 branches that need to be evaluated, matching the number mentioned at the start of this section.
Because the number of possibilities is so high, the algorithm cannot traverse the whole tree naively. LinearDesign minimizes runtime by employing three main strategies. First, it minimizes memory by using pruning, eliminating the worst states at each layer of the tree and keeping track only of the top b results. The hyperparameter b is called beam width, as this pruning approach can be called beam pruning or beam search in this context. Second, the algorithm expands the information a state contains by merging similar states, as many partial foldings have identical base-pair stacks at the 3’ end. In other words, identical subtrees are merged into the same node. Third, the algorithm constructs the solution left-to-right, but remembers the beams for each layer. This is similar to dynamic programming - once the optimal terminal leaf is found, the algorithm can perform an efficient ‘rollback’ to reconstruct the sequence corresponding to that leaf. This leads to much lower storage use, as states at a given depth only contain the most recent change instead of the sequence up until that point, with their storage being constant instead of increasing with depth.
As such, LinearDesign employs a rational approach to risk, using expected value to get near-optimal solutions for MFECAI. This comes in contrast with maximin, which would imply a pessimistic view and choose the codon that guarantees the best-worst case MFECAI, and maximax, which would imply an optimistic view and choose the codon that would result in the highest possible MFECAI without regard to the worst-case scenario.
To summarize, through a first-principled approach and clever optimizations, LinearDesign can optimize sequences previously considered computationally unreachable. It’s a testament to what computer scientists looking at biology problems can do - and the fact that it can lengthen the shelf-life of vaccines while making them cheaper to produce makes this particular algorithm so much more exciting to explore.