16SDB7

16SDB7 is a database of RNA 16S sequences that has been developed by Era7 Bioinformatics. The RNA 16S sequences are automatically curated through a procedure that extracts the information from selected databases in RNAcentral. Our database is open and is included as a featured case in the section Featured use cases within the RNAcentral home.

16SDB7

Input

The RNA 16S database extracts all of its input information from RNAcentral, "a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types" [1].

The raw input is downloaded from the FTP managed by RNACentral. In particular, the input for the process, for a version , are the two following files:

Sequences

This file is a gzipped FASTA file that contains all the sequences of the RNAcentral database, identified by a Unique RNA Sequence identifier (URS). All the sequences included in RNAcentral are different.

It can be publicly accessed from here.

Mappings

This file is a gzipped TSV file that contains information to link every URS with all of the submissions to the input databases, with information of the type of sequence and its origin.

This file can be publicly accessed from here.

 

Data Generation

After each release of RNAcentral, we trigger a procedure that automatically generates a new version of our own database.

The code in charge of automating this procedure, as well as all its documentation is available, under the GNU Affero General Public License v3.0, in this Github repository.

Although RNAcentral gathers sequences from dozens of expert databases ---the full list is available in RNACentral Expert Databases page---, we collect sequences coming from the following five selected databases:

Each RNAcentral sequence (from any of these five databases) is inserted in our own database if we can detect they are part of the 16S gene of some organism. We assume this is true if the sequence fulfills all these requirements:

  1. The sequence has at least one annotation in the mappings file specifying that its RNA type is compatible with the 16S gene. This is done by parsing the type of RNA in the annotations line, accepting only those that equal rRNA.
  2. The sequence has at least one annotation whose description contains the string 16S or the string 16s.
  3. The sequence length is compatible with the expected length of the 16S gene. We accept sequences with lengths in the interval [1300, 1700].
  4. The sequence contains only unambiguous nucleotides; i.e., it is formed exclusively by letters in the set {A, T, C, G, U}, dropping any sequence that contains any other character.

 

Database Summary

The following table collects some interesting data about the versions 9.0 and 10.0 of our RNA 16S database, which correspond, respectively, to releases 9 and 10 of RNACentral. The data collected are:


Database
Number of sequences
Number of species
16S v9.0 1580817 150975
16S v10.0 1604086 153668

Number of sequences: is the total number of entries in the database

Number of species: is the number of different species that have some sequence assigned to them or to some of their descendants in the taxonomic tree.

 

Number of sequences per rank

We assign the taxonomic node that is the Lowest common ancestor of all the assignments that each RNAcentral entry has. The following table shows to what level of the taxonomic tree (rank) are assigned the reference sequences included in 16SDB7 in each of the versions of the database.


Rank
Seqs Assigned to (in 16SDB7 v9.0) 
Seqs Assigned to (in 16SDB7 v10.0
Phylum 1 1
Family 14 14
Genus 8 8
SpeciesGroup 20 20
Species 1549711 1572877
Subspecies 7145 7608
Varietas 104 112
Forma 9 9
NoRank *1 28052 28126

*1: Assignments to strain level are included in "NoRank" counts