Why is the Genetic Code Degenerate?

Thomas Schneider

National Institutes of Health, Frederick Maryland, USA

The genetic code is said to be degenerate because in messenger RNA there are 64 triplets of the four nucleotide bases, the codons, but these translate to only 20 common amino acids. The degree of degeneracy can be understood using information theory. Selecting one amino acid for insertion into a protein takes log2 20 ≈ 4.3 bits of information, but the coding potential in the mRNA is log2 64 = 6 bits. Dividing these to form a unitless measure of the code degeneracy gives the code efficiency, 4.3/6 ≈ 72%.

Surprisingly, similar efficiencies are found for DNA protein binding sites, photosensitive proteins, and motility systems (TDS in preparation). Efficiencies near 70% can be explained given the requirement that choices must be made precisely, with minimal errors, to create distinct biological states such as the selection of particular amino acids. As Shannon showed for communications, this is possible by using a high dimensional coding space. Refinement of the computation using over a billion amino acids from the UniRef database predicts an efficiency of 0.6949, which is significantly higher than the theoretical maximum, ln 2 = 0.6931. Using basic information theory, this discrepancy was used to predict that the error rate of translation is < 1⋅10-3 errors per amino acid, which fits measured rates of 5⋅10-5 to 3⋅10-3. Conversely, taking the average error rate to be exactly 1⋅10-3, the theory fits the data to about 4 decimal places. The theory not only correctly predicts the error rate of translation from amino acid frequencies but it also explains why and to exactly what degree the genetic code is degenerate.