Part 3: Sequence Optimization
3.4. The Codon Adaptation Index
Minimum free energy is not the only variable that can be optimized. After looking at what codons are used to code for amino acids, scientists noticed that some of them have higher translational efficacy than others. In other words, certain codons used in mRNA molecules get translated quickly than others, even if they are synonymous (i.e., code for the same amino acid). The Codon Adaptation Index (CAI) quantifies this ‘bias’ in translation for a given coding sequence.
To build this score, scientists used a reference set of highly expressed genes - genes that get transcribed and translated very often. For a given codon, its relative adaptiveness is the relative (normalized) frequency of that codon among synonymous codons in the reference set. The CAI of an mRNA sequence is then just the geometric mean of the relative adaptiveness of all its codons.
An ‘optimal’ codon usage for the host results in a score close to 1 and means the synonymous codon choice leads to as much transcription as possible. Values near 0 indicate poor adaptation and slower transcription. It’s quite impressive that CAI has very high predictive power, given that the formula does not say anything about the underlying mechanism leading to the speed boost, other than that each codon is translated independently.
Interestingly, we know that high CAI accelerates protein yield mainly by speeding up ribosomal elongation and reducing stalling/frameshifting - in other words, it makes the process of translation as efficient as it can be, which explains why it works. That said, CAI alone is not a good enough optimization metric, as it often produces unstructured, AU-rich mRNA molecules that degrade quickly.
The solution is to come up with an optimization goal that unifies MFE, which allows the mRNA to ‘stick around’ for longer (get degraded more slowly), with CAI, which allows mRNA to produce more proteins. This is exactly what Zhang et al. (2023) did by normalizing CAI to sequence length and subtracting it from MFE, obtaining a value that needs to be minimized. In other words, solutions to the problem should have a MFE that is as low as possible (lower energy leads to slower degradation) and a CAI that is as high as possible.
MFECAI(r) = MFE(r)-|r|/3 * lambda * CAI(r) The hyperparameter lambda controls how much weight is given to MFE vs. CAI. A value of 0 means we’re minimizing MFE alone, while using bigger and bigger values for lambda prioritizes maximizing CAI. The logarithmic scaling, as well as normalization by the codon sequence length, ensures that the two terms are dimensionally comparable.