On the Genetic Code
We’re often told that DNA is "genetic information." But how exactly does a molecule like DNA communicate information? Francis Crick—who discovered the structure of DNA along with Rosalind Franklin, James Watson, and Maurice Wilkins—had the same question. In this 1963 review, Crick lays out his hypotheses and analyzes the evidence in support of an organized system of information within DNA that can be converted into other information—a genetic code. He concludes with a wondrous inquiry: Do all living things share the same language of life?
Deductions about the general nature of the code are drawn from results of biochemical experimentation.
The author is affiliated with the Medical Research Council Laboratory of Molecular Biology, Cambridge, England. This article is adapted from the lecture which he delivered in Stockholm, Sweden, 11 December 1962, on receiving the Nobel prize in medicine and physiology, a prize which he shared with James D. Watson and M. H. F. Wilkins. It is published with the permission of the Nobel Foundation. It will also be included in the complete volume of Nobel lectures in English which is published yearly by the Elsevier Publishing Company, Amsterdam and New York.
It now seems certain that the amino acid sequence of any protein is determined by the sequence of bases in some region of a particular nucleic acid molecule. Twenty different kinds of amino acid are commonly found in protein, and four main kinds of base occur in nucleic acid. The genetic code describes the way in which a sequence of 20 or more things is determined by a sequence of four things of a different type.
It is hardly necessary to stress the biological importance of the problem. It seems likely that most if not all of the genetic information in any organism is carried by nucleic acid—usually by DNA, although certain small viruses use RNA as their genetic material. It is probable that much of this information is used to determine the amino acid sequence of the proteins of that organism. (Whether the genetic information has any other major function we do not yet know). This idea is expressed by the classic slogan of Beadle, "one gene—one enzyme," or, in the more sophisticated but cumbersome terminology of today, "one cistron—one polypeptide chain."
It is one of the more striking generalizations of biochemistry—one which surprisingly is hardly ever mentioned in the biochemical text books - that the 20 amino acids and the four bases, are, with minor reservations, the same throughout nature. As far as I am aware, the presently accepted set of 20 amino acids was first drawn up by Watson and myself in the summer of 1953 in response to a letter of Gamow's.
Here I shall not deal with the intimate technical details of the problem, if only for the reason that I have recently written such a review (1) which will appear shortly. Nor shall I deal with the biochemical details of messenger RNA and protein synthesis. Rather, I shall ask certain general questions about the genetic code and ask how far we can now answer them.
Let us assume that the genetic code is a simple one and ask how many bases code for one amino acid. This coding can hardly be done by a pair of bases, as from four different things we can only form 4 × 4 (= 16) different pairs, whereas we need at least 20 and probably one or two more to act as spaces or for other purposes. However, triplets of bases would give us 64 possibilities. It is convenient to have a word for a set of bases which codes one amino acid, and I shall use the word codon for this.
This brings us to our first question. Do codons overlap? In other words, as we read along the genetic message do we find a base which is a member of two or more codons? It now seems fairly certain that codons do not overlap. If they did, the change of a single base, due to mutation, should alter two or more (adjacent) amino acids, whereas the typical change is to a single amino acid, both in the case of the "spontaneous" mutations, such as occur in the abnormal human hemoglobins, and in chemically induced mutations, such as those produced by the action of nitrous acid and other chemicals on trick tobacco mosaic virus (2). In all probability, therefore, codons do not overlap.
This leads us to the next problem. How is the base sequence divided into codons? There is nothing in the backbone of the nucleic acid, which is perfectly regular, to show us how to group the bases into codons. If, for example, all the codons are triplets, then in addition to the correct reading of the message there are two incorrect readings which we shall obtain if we do not start the grouping into sets of three at the right place. My colleagues and I (3) have recently obtained experimental evidence that each section of the genetic message is indeed read from a fixed point, probably from one end.This fits in very well with the experimental evidence, most clearly shown in the work of Dintzis (4), that the amino acids are assembled into the polypeptide chain in a linear order, starting at the amino end of the chain.
Size of the Codon
This leads us to the next general question: the size of the codon. How many bases are there in any one codon? The experiments to which I have just referred (3) strongly suggest that all (or almost all) codons consist of a triplet of bases, though a small multiple of 3, such as 6 or 9, is not completely ruled out by our data. We were led to this conclusion by the study of mutations in the A and B cistrons of the rII locus of bacteriophage T4. These mutations are believed to be due to the addition or subtraction of one or more bases from the genetic message. They are typically produced by acridines, and cannot be reversed by mutagens which merely change one base into another. Moreover, these mutations almost always render the gene completely inactive, rather than partly so.
By testing such mutants in pairs we can assign them all, without exception, to one of two classes which we call plus and minus. For simplicity one can think of the plus class as having one extra base at some point or other in the genetic message and of the minus class as having one base too few. The crucial experiment is to put together, by genetic recombination, three mutants of the same type into one gene. That is, either (+ with + with +) or (- with - with -). Whereas a single + or a pair of them (+ with +) makes the gene completely inactive, a set of three, suitably chosen, has some activity. Detailed examination of these results shows that they' are exactly what we should expect if the message were read in triplets, starting from one end.
We are sometimes asked what the result would be if we put four +'s in one gene. To answer this my colleagues have recently put together not merely four but six +'s. Such a combination is active, as expected on the basis of our theory, although sets of four or five of them are not. We have also gone a long way toward explaining the production of "minutes," as they are called that is, combinations in which the gene is working at very low efficiency. Our detailed results fit the hypothesis that in some cases when the mechanism comes to a triplet which does not stand for an amino acid (called a "nonsense" triplet) it very occasionally makes a slip and reads, say, only two bases instead of the usual three. These results also enable us to tie down the direction of reading of the genetic message, which in this case is from left to right, as the r11 region is conventionally drawn. A final proof of our ideas can only be obtained through detailed studies on the alterations produced in the amino acid sequence of a protein by mutations of the type discussed here.
One further conclusion of a general nature is suggested by our results. It appears that the number of nonsense triplets is rather low, since we only occasionally come across them. However, this conclusion is less secure than our other deductions about the general nature of the genetic code.
It has not yet been shown directly that the genetic message is colinear with its product—that is, that one end of the gene codes for the amino end of the polypeptide chain and the other for the carboxyl end, and that as one proceeds along the gene one comes in turn to the codons in between in the linear order in which the amino acids are found in the polpeptide chain. This seems highly likely, especially as it has been shown that in several systems mutations affecting the same amino acid are extremely near together on the genetic map. The experimental proof of the colinearity of a gene and the polypeptide chain it produces may be confidently expected within the next year or so.
There is one further general question about the genetic code which we can ask at this point. Is the code universal—that is, the same in all organisms? Preliminary evidence suggests that it may well be. For example, something very like rabbit hemoglobin can be synthesized in a cell-free system of which part comes from rabbit reticulocytes and part from Escherichia coli (5). That this would be the case if the code was very different in these two organisms is not very probable. However, as we shall see, it is now possible to test the universality of the code by more direct experiments.
Attack on the Genetic Code
It is believed, not that DNA itself controls protein synthesis directly in a cell in which DNA is the genetic material, but that the base sequence of the DNA—probably of only one of its chains—is copied onto RNA, and that this special RNA then acts as the genetic messenger and directs the actual process of joining up the amino acids into polypeptide chains. The breakthrough in the coding problem has come from the discovery, made by Nirenberg and Matthaei (6), that one can use synthetic RNA for this purpose. In particular, they found that polyuridylic acid-an RNA in which every base is uracil-would promote the synthesis of polyphenylalanine when added to a cell-free system already known to synthesize polypeptide chains. Thus, one codon for phenylalanine appears to be the sequence UUU (where U stands for uracil; in the same way we use A, G, and C for adenine, guanine, and cytosine, respectively). This discovery has opened the way to a rapid, although somewhat confused, attack on the genetic code.
It would not be appropriate to review this work in detail here. I have discussed critically the earlier work in the review mentioned (1), but such is the pace of work in this field that more recent experiments have already made the discussion out of date, to some extent. However, some general conclusions can safely be drawn.
The technique mainly used so far, both by Nirenberg and his colleagues (6) and by Ochoa and his group (7), has been to synthesize enzymatically "random" polymers of two or three of the four bases. For example, use of a polynucleotide [which I shall call poly (U,C)], having about equal amounts of uracil and cytosine in (presumably) random order, will increase the incorporation of the amino acids phenylalanine, serine, leucine, and proline, and possibly threonine. By using polymers of different composition and assuming a triplet code one can deduce limited information about the composition of certain triplets.
From such work it appears that, with minor reservations, each polynucleotide incorporates a characteristic set of amino acids. Moreover, the four bases appear quite distinct in their effects. A comparison between the triplets tentatively deduced by these methods with the changes in amino acid sequence produced by mutation shows a fair measure of agreement. Moreover, the incorporation requires the same components that are needed for protein synthesis and is inhibited by the same inhibitors. Thus, the system is most unlikely to be a complete artifact and is very probably closely related to genuine protein synthesis.
As to the actual triplets so far proposed, it was first thought that possibly every triplet had to include uracil, but this was neither plausible on theoretical grounds nor supported by the experimental evidence. The first direct evidence that this was not so was obtained by my colleagues Bretscher and Grunberg-Manago (8), who showed that a poly (C,A) would stimulate the incorporation of several amino acids. Recently other workers (9, 10) have reported further evidence of this sort for other polynucleotides not containing uracil. It now seems very likely that many of the 64 triplets, possibly most of them, may code one amino acid or another, and that in general several distinct triplets may code one amino acid. In particular, a very elegant experiment (11) suggests that both (UUC) and (UUG) code leucine (the parentheses imply that the order within the triplets is not yet known). This general idea is supported by several indirect lines of evidence which cannot be presented in detail here. Unfortunately it makes the unambiguous determination of triplets by these methods much more difficult than would be the case if there were only one triplet for each amino acid. Moreover, it is not possible, by using polynucleotides of "random" sequence, to determine the order of bases in a triplet. A start has been made to construct polynucleotides whose exact sequence is known at one end, but the results obtained so far are suggestive rather than conclusive (12). It seems likely, however, from this and other (unpublished) evidence, that the amino end of the polypeptide chain corresponds to the "right-hand" end of the polynucleotide chain—that is, the one with the 2',3' hydroxyls on the sugar.
It seems virtually certain that a single chain of RNA can act as messenger RNA, since poly U is a single chain without secondary structure. If poly A is added to poly U to form a double or triple helix, the combination is inactive. Moreover, there is preliminary evidence (9) which suggests that secondary structure within a polynucleotide inhibits the power to stimulate protein synthesis.
It has yet to be shown by direct biochemical methods, as opposed to the indirect genetic evidence mentioned earlier, that the code is indeed a triplet code.
Attempts have been made, from a study of the changes produced by mutation, to obtain the relative order of the bases within various triplets, but my own view is that such attempts are premature until there are more extensive and more reliable data on the composition of the triplets.
Evidence presented by several groups (8, 9, 11) suggest that poly U stimulates the incorporation of both phenylalanine and a lesser amount of leucine. The meaning of this observation is unclear, but it raises the unfortunate possibility of ambiguous triplets—that is, triplets which may code more than one amino acid. However, one would certainly expect such triplets to be in a minority.
Origin of the Grouping
It seems likely, then, that most of the 64 possible triplets will be grouped into 20 groups. The balance of evidence, both from the cell-free system and from the study of mutation, suggests that this grouping does not occur at random, and that triplets coding the same amino acid may well be rather similar. This raises the main theoretical problem now outstanding. Can this grouping be deduced from theoretical postulates? Unfortunately, it is not difficult to see how the grouping might have arisen at an extremely early stage in evolution by random mutations, so that the particular code we have may perhaps be the result of a series of historical accidents. This point is of more than abstract interest. If the code does indeed have some logical foundation, then it is legitimate to consider all the evidence, both good and bad, in any attempt to deduce it. This is not true if the codons have no simple logical connection. In that case, it makes little sense to guess a codon; the important thing is to provide enough evidence to prove each codon independently. It is not yet clear what evidence can safely be accepted as establishing a codon. What is clear is that most of the experimental evidence so far presented falls short of proof in almost all cases.
In spite of the uncertainty of many of the experimental data, there are certain codes which have been suggested in the past which we can now reject with some degree of confidence.
1) Comma-less triplet codes. All such codes are unlikely, not only because of the genetic evidence but also because of the detailed results from the cell-free system.
2) Two-letter or three-letter codes—for example, a code in which A is equivalent to C, and G to U. As already stated, the results from the cell-free system rule out all such codes.
3) The combination triplet code. In this code all permutations of a given combination code the same amino acid. The experimental results can only be made to fit such a code by very special pleading.
4) Complementary codes. There are several classes of these. Consider a certain triplet in relation to the triplet which is complementary to it on the other chain of the double helix. The second triplet may be considered as being read either in the same direction as the first or in the opposite direction. Thus, if the first triplet is UCC, we consider it in relation to either AGG or (reading in the opposite direction) GGA.
It has been suggested that if a triplet stands for an amino acid its complement must necessarily stand for the same amino acid, or, alternatively in another class of codes, that its complement will stand for no amino acid—that is, will be nonsense.
It has recently been shown by Ochoa's group that poly A stimulates the incorporation of lysine (10). Thus, presumably AAA codes lysine. However, since UUU codes phenylalanine, these facts rule out all the foregoing proposed codes. It is also found that poly (U,G) incorporates quite different amino acids from poly (A,C). Similarly, poly (U,C) differs from poly (A,G) (9, 10). Thus, there is little chance that any of the theories of this class will prove correct. Moreover they are all, in my opinion, unlikely for general theoretical reasons.
A start has already been made on investigations of the role of the same polynucleotides in cell-free systems from different species, to see if the code is the same in all organisms. Eventually it should be relatively easy to discover in this way whether the code is universal and, if it is not, how it differs from organism to organism. The preliminary results presented so far disclose no clear difference, with respect to the code, between E. coli and mammals, and this is encouraging (10, 13).
At the present time, therefore, the genetic code appears to have the following general properties.
1) Most, if not all, codons consist of three (adjacent) bases.
2) Adjacent codons do not overlap.
3) The message is read in the correct groups of three by starting at some fixed point.
4) The code sequence in the gene is colinear with the amino acid sequence, the polypeptide chain being synthesized sequentially from the amino end.
5) In general, more than one triplet codes each amino acid.
6) It is possible that some triplets may code more than one amino acid—that is, they may be ambiguous.
7) Triplets which code the same amino acid are probably rather similar.
8) It is not known whether there is any general rule in accordance with which such codons are grouped together, or whether the grouping is mainly the result of historical accident.
9) The number of triplets which do not code an amino acid is probably small.
10) Certain codes proposed earlier—such as comma-less codes, two- or three-letter codes, the combination code, and various transposable codes—are all unlikely to be correct.
11) The code is probably much the same in different organisms. It may be the same in all organisms, but this is not yet known.
Finally, one should add that in spite of the great complexity of protein synthesis and in spite of the considerable technical difficulties in synthesizing polynucleotides with defined sequences, it is not unreasonable to hope that all these points will be clarified in the near future, and that the genetic code will be completely established on a sound experimental basis within the next few years.
1. F. H. C. Crick, in Progress in Nucleic Acid Research, J. N. Davidson and Waldo E. Cohn, Eds. (Academic Press, New York, in press).
2. H. G. Wittmann, Z. Vererbungslehre, in press; A. Tsugita, J. Mol. Biol. 5, 284, 293 (1962).
3. F. H. C. Crick, L. Barnett, S. Brenner, R. J. Watts-Tobin, Nature 192, 1227 (1961).
4. M. A. Naughton and H. M. Dintzis, Proc. Natl. Acad. Sci. U.S. 48, 1822 (1962).
5. G. von Ehrenstein and F. Lipmann, ibid. 47, 941 (1961).
6. J. H. Matthaei and M. W. Nirenberg, ibid. 47, 1580 (1961); M. W. Nirenberg and J. H. Matthaei, ibid. 47, 1588 (1961); M. W. Nirenberg, J. H. Matthaei, O. W. Jones, ibid. 48, 104 (1962); J. H. Matthaei, O. W. Jones, R. G. Martin, M. W. Nirenberg, ibid. 48, 666 (1962).
7. P. Lengyel, J. F. Speyer, S. Ochoa, ibid. 47, 1936 (1961); J. F. Speyer, P. Lengyel, C. Basilio, S. Ochoa, ibid. 48, 63 (1962); P. Lengyel, J. F. Speyer, C. Basilio, S. Ochoa, ibid. 48, 282 (1962); J. F. Speyer, P. Lengyel, C. Basilio, S. Ochoa, ibid. 48, 441 (1961); C. Basilio, A. J. Wahba, P. Lengyel, J. F. Speyer, S. Ochoa, ibid. 48, 631 (1962).
8. M. S. Bretscher and M. Grunberg-Manago, Nature 195, 283 (1962).
9. O. W. Jones and M. W. Nirenberg, Proc. Natl. Acad. Sci. U.S., in press.
10. R. S. Gardner, A. J. Wahba, C. Basilio, R. S. Miller, P. Lengyel, J. F. Speyer, ibid., in press.
11. B. Weisblum, S. Benzer, R. W. Holley, ibid. 48, 1449 (1962).
12. A. J. Wahba, C. Basilio, J. F. Speyer, P. Lengyel, R. S. Miller, S. Ochoa, ibid. 48, 1683 (1962).
13. H. R. V. Arnstein, R. A. Cox, J. A. Hunt, Nature 194, 1042 (1962); E. S. Maxwell, Proc. Natl. Acad. Sci. U.S. 48, 1639 (1962); I. B. Weinstein and A. N. Schechter, ibid. 48, 1686 (1962).