Part 3: Sequence Optimization
3.2. Nussinov and Zuker
As previously discussed, mRNA molecules may have different 3D configurations - different shapes - based on their sequence. The ability to predict the shape of mRNA molecules - and therefore important properties - computationally is based on mRNA folding.
First, it’s worth noting that mRNA folding has a distant relative - seemingly the same at the surface, but vastly different in reality. For over 50 years, protein folding has been a big challenge in biology. While protein structures - like mRNA structures - can be determined experimentally through techniques such as X-ray crystallography and cryogenic electron microscopy, the methods are expensive and time-consuming, making them unable to keep pace with the rapid accumulation of new protein sequences. As such, finding a way to predict structures computationally instead of experimentally has been a topic of research with huge potential. Naturally, the first approaches to this problem were ab initio - starting from first principles, simulating the folding process, or trying to find the folding configuration that is most stable as reflected by its low free energy. Computing the free energy is then taken as a molecular dynamics problem, where force fields are used to describe the potential energy of atoms in proteins as a result of their spatial arrangement. The modeling techniques have been under development for decades, and while the field is advancing, simulations still yield significant errors compared to experimentally-determined free energies. Moreover, as the systems get more complex or the proteins get bigger, the errors compound exponentially - ‘good’ results can only be obtained for small proteins with relatively short folding times. Unfortunately, the ab initio strategy also bears a high computational cost, with one group simulating one millisecond of folding for a 36 amino-acid protein (really short) over 2 months on a machine with 256 processors. A more extreme example is the Anton platform, which employs a supercomputer containing customized hardware in the form of application-specific integrated circuits (ASICs) to speed up molecular dynamics simulations. Even assuming enough computing power will be reached to process longer sequences, the ‘rules’ governing current simulations of protein folding are not entirely clear. For example, one of the leading software programs using molecular dynamics, Rosetta, uses an energy function with over 140 different terms, most of which are knowledge-based (parameters determined from data obtained experimentally).
Major advances were made in protein folding when AlphaFold was announced in 2018. Using artificial intelligence-based methods, AlphaFold was able to beat the other computational prediction methods available at the time. AlphaFold2, announced in 2021, marked another significant leap by being the first end-to-end model that could predict protein structures at the atomic level with an accuracy comparable to experimental methods.
Many consider that the field of mRNA folding started with a paper by Nussinov et al. (1978), which managed to predict mRNA structure with low but relevant accuracy. The algorithm does that by computing the most stable conformation, where stability is recognized as being directly proportional to the number of base pairs. In other words, the more base pairs there are, the better the stability of the molecule is. Using dynamic programming, Nussinov’s method has a complexity of O(n^3), managing to fold sequences hundreds of base pairs long using the computing resources available at the time. While the predictions were not excellent, they were good enough that other scientists started building on top of the method.
Ultimately, Zuker & Stiegler (1981) built on the work and arrived at much better predictions by recognizing that structures other than base pairs contribute to the free energy of a molecule. By simulating all the possible base pairs and determining the minimum free energy (MFE) structure, one can infer the most likely conformation of mRNA with some precision. More importantly, MFE is highly correlated with lab measurements of molecule stability. Zuker’s algorithm is still used in mRNA software today, but is limited to mRNA molecules only thousands of bases long, as it inherits its predecessor’s complexity of O(n^3).
Before diving deeper into folding, it’s important to recognize that both Nussinov et al. (1978) and Zuker & Stiegler (1981) made a big assumption to make their algorithms more efficient, namely that there will be no pseudoknots. In essence, this rule can be restated as: if you draw the mRNA with base pairings, no two pairing lines will go over each other. While this led to many advances in the field, in practice, a relevant number of base pairs are pseudo knots, which constitutes a relevant source of errors in MFE predictions.