Stop! Codon time!


Editor's Introduction

Stop codon reassignments in the wild

annotated by

A central idea in genetics is that all microbes follow the same genetic code when producing proteins.  What happens when organisms start to use this universal vocabulary in a different way?  The authors of this paper explore this question and the resulting implications it has for the field of synthetic biology.  

Paper Details

Original title
Stop codon reassignments in the wild
H. James Tripp
Original publication date
Vol. 344 no. 6186 pp. 909-913
Issue name


The canonical genetic code is assumed to be deeply conserved across all domains of life with very few exceptions. By scanning 5.6 trillion base pairs of metagenomic data for stop codon reassignment events, we detected recoding in a substantial fraction of the >1700 environmental samples examined. We observed extensive opal and amber stop codon reassignments in bacteriophages and of opal in bacteria. Our data indicate that bacteriophages can infect hosts with a different genetic code and demonstrate phage-host antagonism based on code differences. The abundance and diversity of genetic codes present in environmental organisms should be considered in the design of engineered organisms with altered genetic codes in order to preclude the exchange of genetic information with naturally occurring species.


Since the discovery of the genetic code and protein translation mechanisms (1), a limited number of variations of the standard assignment between unique base triplets (codons) and their encoded amino acids and translational stop signals have been found in bacteria and phages (27). Given the apparent ubiquity of the canonical genetic code, the design of genomically recoded organisms with noncanonical codes has been suggested as a means to prevent horizontal gene transfer between laboratory and environmental organisms (810). It is also predicted that genomically recoded organisms are immune to infection by viruses, under the assumption that phages and their hosts must share a common genetic code (6). This paradigm is supported by the observation of increased resistance of genomically recoded bacteria to phages with a canonical code (9). Despite these assumptions and accompanying lines of evidence, it remains unclear whether differential and noncanonical codon usage represents an absolute barrier to phage infection and genetic exchange between organisms.

Our knowledge of the diversity of genetic codes and their use by viruses and their hosts is primarily derived from the analysis of cultivated organisms. This is due to our limited access to genome sequences from uncultivated organisms, which are estimated to account for 99% in prokaryotes (11). Advances in single-cell sequencing and metagenome assembly technologies have enabled the reconstruction of genomes of uncultivated bacterial and archaeal lineages (1214) and the discovery of a previously unknown reassignment of TGA opal stop codons to glycine (4514). These initial findings suggest that large-scale systematic studies of uncultivated microorganisms and viruses may reveal the extent and modes of divergence from the canonical genetic code operating in nature.

To explore alternative genetic codes, we carried out a systematic analysis of stop codon reassignments from the canonical TAG amber, TGA opal, and TAA ochre codons in assembled metagenomes and metatranscriptomes from environmental and host-associated samples, single-cell genomes of uncultivated bacteria and archaea, and a collection of viral sequences (Fig. 1A) (15). All sequence data were obtained from the Integrated Microbial Genomes (IMG) database (16). This global collection of sequences comprised 1776 samples from 145 studies, including 750 samples obtained from 17 human body sites (fig. S1) (17). In total, 5.6 terabases of sequence data, including 450 Gb of contiguous sequences (contigs) greater than 1 kb, were analyzed. All samples were classified into human-associated, other host–associated, soil, marine, or freshwater environments according to their metadata (1518).


Fig. 1.  Recoded DNA sequences identified worldwide. (A) Workflow used to identify contigs that contain stop codon reassignment. (B) Map showing the locations of 82 environmental samples around the globe together with nine sample sites (derived from 212 samples) of the human body for which recoded sequences have been identified.

Fig 1A Questions 

Question:  How much alternative genetic coding of uncultivated bacteria and phage genomes can be detected in all the metagenome data sets found in JGI’s IMG database?

Question: What is the overall distribution of the three types of stop codon reassignments in the data set?

Fig 1A Methods

1.  Assemble short reads into longer contigs, making detection of alternative genetic coding easier.

2. Use gene prediction software under assumption of standard coding and recoding. Presume recoding when gene prediction improves under assumption of recoding.

3. Confirm recoding predictions with manual quality assurance and application of automated filtering thresholds to remove false positives.

Fig 1A Answers

Question:  How much alternative genetic coding of uncultivated bacteria and phage genomes can be detected in all the metagenome data sets found in JGI’s IMG database?

Answer: After sequence assembly, detection of recoding, filtration, and quality control, 198 Mb of recoded sequences were found in 5.6 Tb of raw data.

Question: What is the overall distribution of the three types of stop codon reassignments in the data set?

Answer: Most of the reassignments are opal (69%), little are amber (7%), and the rest are ochre (24%).

Fig 1B Question

Question: What is the distribution of the three reassigned stop codons with respect to geography and location within the human body?

Fig 1B Answers

Question: What is the distribution of the three reassigned stop codons with respect to geography and location within the human body?

Answer: In most environmental samples, which came from the Americas and Western Europe, opal recoding predominated over ochre and amber.  In the few samples where amber or ochre recoding was detected, the amber and ochre recoding rate was high. In the human body, no ochre recoding was found.

Inference: For reasons unknown, alternative genetic coding patterns vary by habitat. 

We used a statistic of increased coding potential under alternate genetic codes as calculated by ab initio gene finder Prodigal (19), which was selected for its low rate of false-positive predictions (15). Contigs showing significantly higher coding potential when annotated with modified translation tables were forwarded to filtering and quality control to confirm stop codon reassignment through multiple sequence alignments to known homologs from the National Center for Biotechnology Information protein database (Fig. 1A) (15).

By applying this approach to 450 Gb of contigs larger than 1 kb in size, we identified 31,415 contigs with evidence of stop coding reassignment, adding up to a total of 198 Mb of recoded DNA (Fig. 1A). No recoding was observed in the metatranscriptome data. Varying ratios of reassigned to total contigs were observed in samples from terrestrial and aquatic environments and from human mouth, throat, and stool microbiomes (Fig. 1B and fig. S1). The greatest reassignment ratio was in a groundwater sample from a sulfidic aquifer, where 10.4% of all the assembled contigs displayed evidence that one of the three stop codons had been reassigned (table S2). High ratios of contig recoding were also detected in human oral microbiome (table S2).

Reassignment of all three stop codons was found but with different preferences by domain and habitat (Fig. 2). We observed distinct patterns of stop codon reassignment in the three domains of life, with bacteria showing only opal reassignments, ochre reassignments restricted to eukaryotes, and archaea devoid of codon reassignments (15). Among viruses, we found both amber and opal reassignments. These observations are restricted to DNA viruses only because of the scarcity of sequence information for RNA viruses of bacteria and archaea (Fig. 2) (15). Metagenomes of human body sites showed a high rate of reassignments compared with most other sampling sites. Only 10% of all contigs examined in this study originated from human body sites, but they represented 51% of all contigs with codon reassignments. The majority of the remaining stop codon reassignments were found in freshwater environments (44%), representing 13% (56.0 Gb) of all examined metagenomes. In contrast, marine samples contributed only 4% of recoded sequence, although they represent 48% (211.6 Gb) of the total data set (15). This suggests that codon reassignments are more abundant in freshwater than in marine samples.


Fig. 2.  Stop codon reassignment by taxonomy and habitat.  Relative abundance of amberochre, and opal stop codon reassignments among bacteria, eukaryotes, and viruses in metagenomes of different habitats. For the sake of clarity, contig sets with less than 1 Mb total length for each combination of domain, stop codon, and habitat were excluded; see fig. S5, A and B (15).

Question being asked

How does reassignment of the three stop codons vary by domain of life and habitat?


Bacterial contigs showed reassignment in all habitats, but only for opal stop codon. Eukaryotic contigs showed reassignment only for ochre and only in freshwater habitats. Viral contigs showed reassignment only for opal and amber, and only in human samples.


Inference: For reasons unknown, stop codon reassignment varies by domain of life

Among bacteria, previous reports of recoding were restricted to the reassignment of opal stop codon (3520). Despite our extensive sampling of bacterial sequences, we also observed reassignment exclusively for opal codonsOpal reassignments to Trp have been previously observed in Mollicutes (20) and Candidatus Hodgkinia cicadicola (3), and opal reassignments to Gly have been observed in uncultivated representatives of candidate phyla SR1 and Gracilibacteria(45). Our extensive survey suggests that opal reassignment in bacteria is likely limited to the same specific lineages (Fig. 3). The multiple SR1 and Gracilibacteria sequences enabled us to explore the evolutionary origin of stop codon reassignment in these closely related, uncultivated bacterial lineages. A maximum likelihood phylogenetic tree revealed that opal reassignment occurred in the last common ancestor of these sister lineages after its separation from the Peregrines (PERs) group and before the divergence of SR1 and Gracilibacteria (Fig. 3). The same phylogenetic analysis performed for the opal reassigned members of the class Mollicutes indicates a single reassignment event within the last common ancestor of the Mycoplasmatales and Entomoplasmatales.


Fig. 3.  A maximum likelihood phylogenetic tree of bacterial stop codon reassigned sequences, based on concatenated alignments of protein-coding marker genes. The arrow at the root of the tree points to the outgroup (Terrabacteria). The tree shows the recoded taxonomic groups Mycoplasmatales and Entomoplasmatales (opal to Trp), SR1 (opal to Gly), and Gracilibacteria (opal to Gly) along with non-recoded reference phyla. The highly reduced alpha-proteobacterial Candidatus Hodgkinia cicadicola genome was not included. The red circles denote two reassignment events. PVC, PlanctomycetesVerrucomicrobia, and Chlamydiae; FCB, FibrobacteresChlorobi, and Bacteroidetes. Sequences published in (4514).


What is the evolutionary origin of the opal stop codon reassignment in bacteria?


Opal stop codon reassignment originated twice: once in the ancestor of two classes of culturable Mollicutes, and once in the ancestor of unculturable SR1 and Gracilibacteria members. One reassignment was to tryptophan and the other was to glycine.


Inference: Stop codon reassignments can emerge in different phylogenetic groups and different times and involve different amino acid reassignments for the same stop codon.

Although the average GC content of the entire data set was 55%, recoded bacterial sequences had an average GC content of 32%, consistent with previous studies (21). In recoded contigs, we observed a shift to low-GC synonymous codons and/or a shift to low-GC nonsynonymous codons for amino acids with similar chemical properties, supporting the hypothesis that changes in the genomic GC content correlate with and may drive reassignment of stop codons (figs. S3 and S4). In addition, low-GC organisms used the ochre stop codon (TAA) to a higher extent (84% in organisms with GC content <32% versus 41% in organisms with GC content >64%) than amber (TAG) and opal (TGA). In extreme cases of low GC content, nearly all genes terminate in an ochrestop codon.

Although our pipeline for alternative genetic code detection was initially developed for prokaryotic genes, it can identify eukaryotic sequences with reassigned stop codons. In agreement with previous reports (2), among eukaryotic sequences we observed recoding for opal reassignment in mitochondrial sequences and ochre and amber reassignment in nuclear sequences. Nuclear sequences with stop codon reassignments appear to belong to the representatives of Ciliophora (table S3), and ochre reassignments were found exclusively in freshwater samples (Fig. 2).

We identified 19 complete and nearly complete DNA phages with amber stop codon reassignments in 177 out of 784 human microbiome samples (15). Previous reports of alternative genetic codes in DNA viruses are limited to opal reassignment to Trp in Mycoplasma phages (722). In our study, we identified two phages with opal reassignment to Gly, a phage with amber reassignment to Ser, and 14 phages with amber reassignment to Gln (table S8)—a code previously observed only in nuclear genes of eukaryotes (23). These reassignments are supported by protein alignments (figs. S6 to S8), as well as by the presence of amber-recognizing Gln-tRNACUAand Ser-tRNACUA in the phage contigs (figs. S9 and S10). We infer from the genome structure of recoded DNA phages that they belong to the order of Caudovirales. None of the amber-reassigned phage sequences was embedded into recognizable bacterial sequences, and no integrases were detected in the phage genomes, suggesting that these DNA phages are lytic.

Because phages largely depend on the translational machinery of their hosts, it has been suggested that they must use the same genetic code (69). Evidence supporting the matching usage of genetic codes between an opal-reassigned phage and its host was obtained by looking for footprints of phage infections in phage-derived spacers of the CRISPR (clustered regularly interspaced short palindromic repeat) adaptive immune system of bacteria (24). Out of 26 unique spacers found on CRISPR-harboring contigs with opal reassignment, two spacers had an exact full-length match in the sequence of opal-recoded phages (table S9). Alignment of protein-coding genes on both contigs confirmed that they have opal to Gly reassignment.

The observation of amber reassignments in phages raises questions about the genetic code of their target hosts, given the apparent absence of amber-recoded bacterial genomes from environments in which amber-recoded phages were present (Fig. 2). This raises the possibility that genetic code differences between phages and hosts do not constitute an obligate barrier to phage infection. By analyzing 29,017 spacers found in CRISPR elements from 553 human oral and stool samples, we identified five spacers (each 33 to 37 bp long) that were identical to sequences from three different amber-recoded phage genomes (table S9). The contigs containing the CRISPR spacers also included bacterial genes with the full complement of canonical stop codons. The identified bacterial genes were nearly identical to genes from two Prevotella strains that were isolated from human airways and subgingival plaque and shown to have a standard genetic code. These data suggest that amber-reassigned phages can infect hosts with different genetic codes, in this case the standard code.

To gain further insight into mechanisms that may enable amber-recoded phages to infect hosts with different genetic codes, we examined the assembled genomes of amber-recoded phages. In several of these phage genomes, we identified genes for peptide chain release factor 2 (RF-2), which terminates translation at ochre and opal stop codons. Sequenced isolate phages lack genes for release factors, apparently harnessing the host-encoded release factors. The presence of RF-2 in a phage genome suggests that the phage may infect a host lacking RF-2; a hallmark of opal-reassigned bacterial genomes (325). Consistent with this possibility, the human oral cavity environments where amber-recoded RF-2–containing phages were detected lack amber-recoded bacteria but are enriched for opal-recoded bacteria. A further atypical feature noted in the genome of one of these phages (phage 2) is a bimodal pattern of amber reassignment across the genome (Fig. 4A). Initial annotation of this phage genome suggests that it is a lytic phage broadly related to T4 (fig. S11), in which amber has been reassigned to code for Gln.


Fig. 4  Phage infections across genetic code boundaries.  (A) Genome of phage 2. The phage genome is broadly divided into two domains with strong bias in codon utilization as well as strand preference. (B) Model of infection of opal-recoded hosts by amber > Gln-recoded phages.


 How do amber-reassigned phages infect bacteria with different genetic codes?

Fig 4A

Amber-reassignment in phages is sometimes limited to one part of the phage genome.

Observation: The distribution of reassigned amber codons across the genome showed distinct regions. There is a low-amber (LA) part of the genome with infrequent in-frame amber codons that encodes proteins involved in the early-stage infection, and a high-amber (HA) domain containing frequent in-frame amber codons that encodes genes involved in late stage-stage infection. The LA domain contains a gene that encodes release factor-2. The HA domain contains a gene that encodes a tRNA for reassigning amber to glutamine (Gln).

Inference: Early expression of RF-2 would interfere with host expression of its opal recoded genes, which cannot tolerate RF-2. Then, later expression of the phage tRNA for amber reassignment would enable the expression of amber recoded genes in the phage.  

Fig 4B

Hypothetical model of an amber-reassigned phage infecting an opal-reassigned host.

Model: During initial phage infection, LA domain genes are expressed, including RF-2, which interferes with translation of opal-reassigned host genes. The host’s RF-1 does not interfere with expression of phage LA domain genes during early infection, because they do not use amber stop codons. The phage’s interference with host RF-1 expression eventually depletes the host’s RF-1 supply. Then, late expression of phage Gln-tRNACUA allows expression of the phage’s HA domain genes, because there is no host RF-1 to interfere with phage Gln-tRNACUA. This explains how amber-reassigned phages can infect an opal-recoded host.

This phage also contains a noncanonical Gln-tRNACUA, but closer examination of amber distribution across its genome reveals two large domains with distinct gene content and codon usage. The low-amber (LA) domain contains genes often found in early-stage phage infection, such as DNA polymerase. The LA domain also contains the RF-2 gene required for normal translation of amber-recoded genes. Open reading frames in the LA domain are almost entirely devoid of in-frame amber codons and instead rely nearly exclusively on canonical glutamine codons to encode for glutamine (Fig. 4A). In contrast, the high-amber (HA) domain with frequent in-frame amber codons contains genes often associated with late stages of phage infection, such as packaging and assembly components (e.g., predicted tail fiber protein, minor tail protein, tail tape measure protein, or tail-associated lysozyme) (Fig. 4A). This distinct codon utilization, combined with the presence of RF-2 and a Gln-tRNACUA in the amber-recoded phage, suggests that the amber-recoded phage actively interferes with the translation of its presumed opal-recoded host through a proposed mode of phage-host antagonism (Fig. 4B). In this model, upon initial phage infection abundant host-derived RF-1 (the releasing factor that terminates peptide chain elongation at amber codons) interferes with the translation of amber-containing phage HA domain genes, so they are initially not expressed. In contrast, critical amber-free phage LA domain genes can be normally translated. Next, phage-derived expression of RF-2 increasingly interferes with translation of opal-recoded host genes. Last, the simultaneous depletion in host-derived RF-1 and the increasing availability of phage-derived Gln-tRNACUA enable the efficient production of assembly and packaging proteins from the phage HA-domain. Although direct in vivo observations of such processes remain to be established, this evidence supports a mechanism of phage-host antagonism in which the host’s viability is undermined by the phage through the targeted codon-based disruption of the translation of the host’s genetic code.

This survey of environmental sequence data revealed the abundance and diversity of stop codon reassignments in prokaryotes and phages. Several lines of evidence suggest that phages are not obligated to adapt to the codon usage of their hosts and that phages can exploit differences in codon usage to manipulate their hosts. Recently, genomically recoded organisms were created in an attempt to isolate the organism’s genetic information from horizontal transfer to natural organisms and viruses (9). The diversity and abundance of recoding among uncultured environmental microbes and their phages suggests that even synthetic genomically recoded organisms (9) may not be immune to the exchange of genetic information with microbes and phages that populate many ecosystems.

Supplementary Materials

Materials and Methods

Supplementary Text

Figs. S1 to S12

Tables S1 to S11

References (2653)

References and Notes

  1. M. Nirenberg, P. Leder, M. Bernfield, R. Brimacombe, J. Trupin, F. Rottman, C. O’Neal, RNA codewords and protein synthesis, VII. On the general nature of the RNA code. Proc. Natl. Acad. Sci. U.S.A. 53, 1161–1168 (1965).

  2. R. D. Knight, S. J. Freeland, L. F. Landweber, Rewiring the keyboard: Evolvability of the genetic code. Nat. Rev. Genet. 2, 49–58 (2001).

  3. J. P. McCutcheon, B. R. McDonald, N. A. Moran, Origin of an alternative genetic code in the extremely small and GC-rich genome of a bacterial symbiont. PLOS Genet. 5, e1000565 (2009).

  4. J. H. Campbell, P. O’Donoghue, A. G. Campbell, P. Schwientek, A. Sczyrba, T. Woyke, D. Söll, M. Podar, UGA is an additional glycine codon in uncultured SR1 bacteria from the human microbiota. Proc. Natl. Acad. Sci. U.S.A. 110, 5540–5545 (2013).

  5. C. Rinke, P. Schwientek, A. Sczyrba, N. N. Ivanova, I. J. Anderson, J. F. Cheng, A. Darling, S. Malfatti, B. K. Swan, E. A. Gies, J. A. Dodsworth, B. P. Hedlund, G. Tsiamis, S. M. Sievert, W. T. Liu, J. A. Eisen, S. J. Hallam, N. C. Kyrpides, R. Stepanauskas, E. M. Rubin, P. Hugenholtz, T. Woyke, Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013).

  6. L. A. Shackelton, E. C. Holmes, The role of alternative genetic codes in viral evolution and emergence. J. Theor. Biol. 254, 128–134 (2008).

  7. L. L. Voelker, K. Dybvig, Sequence analysis of the Mycoplasma arthritidis bacteriophage MAV1 genome identifies the putative virulence factor. Gene 233, 101–107 (1999).

  8. M. J. Lajoie, S. Kosuri, J. A. Mosberg, C. J. Gregg, D. Zhang, G. M. Church, Probing the limits of genetic recoding in essential genes. Science 342, 361–363 (2013).

  9. M. J. Lajoie, A. J. Rovner, D. B. Goodman, H. R. Aerni, A. D. Haimovich, G. Kuznetsov, J. A. Mercer, H. H. Wang, P. A. Carr, J. A. Mosberg, N. Rohland, P. G. Schultz, J. M. Jacobson, J. Rinehart, G. M. Church, F. J. Isaacs, Genomically recoded organisms expand biological functions. Science 342, 357–360 (2013).

  10. J. S. Dymond, S. M. Richardson, C. E. Coombes, T. Babatz, H. Muller, N. Annaluru, W. J. Blake, J. W. Schwerzmann, J. Dai, D. L. Lindstrom, A. C. Boeke, D. E. Gottschling, S. Chandrasegaran, J. S. Bader, J. D. Boeke, Synthetic chromosome arms function in yeast and generate phenotypic diversity by design. Nature 477, 471–476 (2011).

  11. M. S. Rappé, S. J. Giovannoni, The uncultured microbial majority. Annu. Rev. Microbiol. 57, 369–394 (2003).

  12. M. Albertsen, P. Hugenholtz, A. Skarshewski, K. L. Nielsen, G. W. Tyson, P. H. Nielsen, Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538 (2013).

  13. T. Woyke, G. Xie, A. Copeland, J. M. González, C. Han, H. Kiss, J. H. Saw, P. Senin, C. Yang, S. Chatterji, J. F. Cheng, J. A. Eisen, M. E. Sieracki, R. Stepanauskas, Assembling the marine metagenome, one cell at a time. PLOS ONE 4, e5299 (2009).

  14. K. C. Wrighton, B. C. Thomas, I. Sharon, C. S. Miller, C. J. Castelle, N. C. VerBerkmoes, M. J. Wilkins, R. L. Hettich, M. S. Lipton, K. H. Williams, P. E. Long, J. F. Banfield, Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science 337, 1661–1665 (2012).

  15. Materials and methods are available as supplementary materials on Science Online.

  16. V. M. Markowitz, I. M. Chen, K. Chu, E. Szeto, K. Palaniappan, Y. Grechkin, A. Ratner, B. Jacob, A. Pati, M. Huntemann, K. Liolios, I. Pagani, I. Anderson, K. Mavromatis, N. N. Ivanova, N. C. Kyrpides, IMG/M: The integrated metagenome data management and comparative analysis system. Nucleic Acids Res. 40, D123–D129 (2012).

  17. Human Microbiome Project Consortium, A framework for human microbiome research. Nature 486, 215–221 (2012).

  18. N. Ivanova, S. G. Tringe, K. Liolios, W. T. Liu, N. Morrison, P. Hugenholtz, N. C. Kyrpides, A call for standardized classification of metagenome projects. Environ. Microbiol. 12, 1803–1805 (2010).

  19. D. Hyatt, G. L. Chen, P. F. Locascio, M. L. Land, F. W. Larimer, L. J. Hauser, Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).

  20. J. M. Bové, Molecular features of mollicutes. Clin. Infect. Dis. 17 (suppl. 1), S10–S31 (1993).

  21. S. Osawa, T. H. Jukes, K. Watanabe, A. Muto, Recent evidence for evolution of the genetic code. Microbiol. Rev. 56, 229–264 (1992).

  22. H. Tu, L. L. Voelker, X. Shen, K. Dybvig, Complete nucleotide sequence of the mycoplasma virus P1 genome. Plasmid 45, 122–126 (2001).

  23. S. U. Schneider, M. B. Leible, X. P. Yang, Strong homology between the small subunit of ribulose-1,5-bisphosphate carboxylase/oxygenase of two species of Acetabularia and the occurrence of unusual codon usage. Mol. Gen. Genet. 218, 445–452 (1989).

  24. L. A. Marraffini, E. J. Sontheimer, CRISPR interference: RNA-directed adaptive immunity in bacteria and archaea. Nat. Rev. Genet. 11, 181–190 (2010).

  25. Y. Inagaki, Y. Bessho, S. Osawa, Lack of peptide-release activity responding to codon UGA in Mycoplasma capricolum. Nucleic Acids Res. 21, 1335–1338 (1993).

  26. C. Camacho, G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, T. L. Madden, BLAST+: Architecture and applications. BMC Bioinformatics 10, 421 (2009).

  27. S. Van Dongen, “A Cluster Algorithm for Graphs” Technical Report INS-R0010, (National Research Institute for Mathematics and Computer Science in the Netherlands, Amserdam, 2000).

  28. R. C. Edgar, Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).

  29. R. C. Edgar, MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).

  30. S. R. Eddy, A new generation of homology search tools based on probabilistic inference. Genome Inform. 23, 205–211 (2009).

  31. D. H. Haft, J. D. Selengut, R. A. Richter, D. Harkins, M. K. Basu, E. Beck, TIGRFAMs and Genome Properties in 2013. Nucleic Acids Res. 41, D387–D395 (2013).

  32. M. N. Price, P. S. Dehal, A. P. Arkin, FastTree 2—approximately maximum-likelihood trees for large alignments. PLOS ONE 5, e9490 (2010).

  33. W. Ludwig, O. Strunk, R. Westram, L. Richter, H. Meier, A. Yadhukumar, T. Buchner, S. Lai, G. Steppi, W. Jobb, I. Förster, S. Brettske, A. W. Gerber, O. Ginhart, S. Gross, S. Grumann, R. Hermann, A. Jost, T. König, R. Liss, M. Lüssmann, B. May, B. Nonhoff, R. Reichel, A. Strehlow, N. Stamatakis, A. Stuckmann, M. Vilbig, T. Lenke, A. Ludwig, K. H. Bode, Schleifer, ARB: A software environment for sequence data. Nucleic Acids Res. 32, 1363–1371 (2004).

  34. Letunic, P. Bork, Interactive Tree Of Life v2: Online annotation and display of phylogenetic trees made easy. Nucleic Acids Res. 39 (suppl.), W475–W478 (2011).

  35. W. J. Kent, BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

  36. F. Sievers, D. G. Higgins, Clustal Omega, accurate alignment of very large numbers of sequences. Methods Mol. Biol. 1079, 105–116 (2014).

  37. M. Waterhouse, J. B. Procter, D. M. Martin, M. Clamp, G. J. Barton, Jalview Version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics 25, 1189–1191 (2009).

  38. E. P. Nawrocki, S. R. Eddy, Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).

  39. P. Schattner, A. N. Brooks, T. M. Lowe, The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic Acids Res. 33, W686–W689 (2005).

  40. D. A. Sorescu, M. Möhl, M. Mann, R. Backofen, S. Will, CARNA—alignment of RNA structure ensembles. Nucleic Acids Res. 40, W49–W53 (2012).

  41. R. C. Edgar, PILER-CR: Fast and accurate identification of CRISPR repeats. BMC Bioinformatics 8, 18 (2007).

  42. C. Bland, T. L. Ramsey, F. Sabree, M. Lowe, K. Brown, N. C. Kyrpides, P. Hugenholtz, CRISPR recognition tool (CRT): A tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatics 8, 209 (2007).

  43. D. J. Young, C. D. Edgar, J. Murphy, J. Fredebohm, E. S. Poole, W. P. Tate, Bioinformatic, structural, and functional analyses support release factor-like MTRF1 as a protein able to decode nonstandard stop codons beginning with adenine in vertebrate mitochondria. RNA 16, 1146–1155 (2010).

  44. Tamura, D. Peterson, N. Peterson, G. Stecher, M. Nei, S. Kumar, MEGA5: Molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol. Biol. Evol. 28, 2731–2739 (2011).

  45. F. E. Dewhirst, T. Chen, J. Izard, B. J. Paster, A. C. Tanner, W. H. Yu, A. Lakshmanan, W. G. Wade, The human oral microbiome. J. Bacteriol. 192, 5002–5017 (2010).

  46. T. Takeshita, N. Suzuki, Y. Nakano, M. Yasui, M. Yoneda, Y. Shimazaki, T. Hirofuji, Y. Yamashita, Discrimination of the oral microbiota associated with high hydrogen sulfide and methyl mercaptan production. Sci. Rep 2, 215 (2012).

  47. W. Zhu, A. Lomsadze, M. Borodovsky, Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 38, e132 (2010).

  48. L. Delcher, K. A. Bratke, E. C. Powers, S. L. Salzberg, Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23, 673–679 (2007).

  49. Pati, N. N. Ivanova, N. Mikhailova, G. Ovchinnikova, S. D. Hooper, A. Lykidis, N. C. Kyrpides, GenePRIMP: A gene prediction improvement pipeline for prokaryotic genomes. Nat. Methods 7, 455–457 (2010).

  50. M. J. Bibb, P. R. Findlay, M. W. Johnson, The relationship between base composition and codon usage in bacterial genes and its use for the simple and reliable identification of protein-coding sequences. Gene 30, 157–166 (1984).

  51. Mavromatis, N. Ivanova, K. Barry, H. Shapiro, E. Goltsman, A. C. McHardy, I. Rigoutsos, A. Salamov, F. Korzeniewski, M. Land, A. Lapidus, I. Grigoriev, P. Richardson, P. Hugenholtz, N. C. Kyrpides, Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods 4, 495–500 (2007).

  52. H. B. Nielsen et al., Dependencies among metagenomic species, viruses, plasmids and units of genetic variation, Sequences Submitted to GenBank (2013); see

  53. Holmfeldt, N. Solonenko, M. Shah, K. Corrier, L. Riemann, N. C. Verberkmoes, M. B. Sullivan, Twelve previously unknown phage genera are ubiquitous in global oceans. Proc. Natl. Acad. Sci. U.S.A. 110, 12798–12803 (2013).

Acknowledgments: We thank the DOE JGI production sequencing, IMG, and Genomes OnLine Database teams for their support and J. Kim, A. Tadmor, and A. Nord for reviewing the manuscript. The work conducted by the DOE JGI was supported in part by the Office of Science of DOE under contract DE-AC02-05CH11231. Supporting data can be accessed through and can be downloaded from