The habitats of a microbe can provide pertinent information for understanding the physiology, evolution, and interaction of the microbe within microbial communities.
Therefore, identifying where a species exists is the first step in the study of microbiology.
Generally, 16S rRNA sequences are used for the taxonomical assignment of microbes and thus are the basic information for identifying the microbes in an environment.
However, current 16S rRNA databases lack information regarding environments and are biased to culturable microbes.
Fortunately, metagenomic studies, which focus on the entire microbial community rather than one species, have produced millions of nucleotide sequences from the microbial communities of various environments.
These sequences contain abundant 16S rRNA sequences of culturable and unculturable microbes from different environments.
Therefore, these 16S rRNA sequences can be used as a reference source for microbial species from all over the world.
To search for possible habitats of microbes, we have collected 16S rRNA sequences generated by Roche 454 platforms to create MetaMetaDB.
Comparing a particular 16S rRNA sequence(s) with the representative 16S rRNA sequences from different environments using BLAST, MetaMetaDB calculates the Microbial Habitability Index (MHI) in each environment to infer the possibility of finding that microbe(s) in each environment.
Q & A
=== Please note that the figures and statisticas beneath are generated by the dataset of version 1 (see the archive). ===
Q1. How to use MetaMetaDB? [Expand/Hide]
* Paste or upload the intact 16S rRNA sequences in FASTA format. Sequences in the figure are from example sequences.
* Click "Execute!" and MetaMetaDB will run a BLAST search against 16S rRNA sequences from diverse environments.
Q2. How does MetaMetaDB find the habitats of my microbes? [Expand/Hide]
After you input query 16S rRNA sequences, MetaMetaDB first runs a BLAST search against 1,241,213 representative 16S rRNA sequences, resulting a hit list as follows:
(The result is from the one of the example sequences from Helicobacter pylori 26695.)
MetaMetaDB then re-generates a list of hits sorted by identity, which by our definition is the number of aligned nucleotides divided by the summation of full length of the hit sequence and the gap(s) appeared in the hit sequence
According to the sorted list, MetaMetaDB counts the number of hits above 97%, 95%, 90%, 85%, and 80% identity, and calculates the Microbial Habitability Index (MHI)
for each environmental category e
with an identity threshold c
by the following formula:
is the number of hits that are marked by e
and above identity c
is the total number of sequences in the database (that is, 1,241,213), and R(e)
is the total number of sequences marked by e
. MHIs are weighting by tf-idf (term frequency-inverse document frequency)
Q3. What does the output mean? [Expand/Hide]
The result of each query sequence is listed and are clickable to extend as the example below:
In the figure, the y
axis shows the Microbial Habitability Indices (MHI)
, which are listed below and marked by different colors.
The columns show the ummary of the MHIs calculated by the BLAST hits that are above the identity shown below the column, as described in Q2
For the environments whose MHIs are less than 1% in each column, they are summed up and labeled as "other."
All the MHIs are listed in the bottom of the figure and can be seen by the link of "Statistic."
For example, the microbial habitability based on hits above 95% identity in gut-associated environments was 81.93%.
The identity thresholds of 97%, 95%, 90%, 85%, and 80% approximately correspond to the taxonomic levels of species, genus, family, order, and class, respectively.
Q4. I have several 16S rRNA sequences and I want to treat them as one group. What should I do? [Expand/Hide]
* After input the sequences, check the box as marked by red.
The output is similar as shown in Q3
The Microbial Habitability Indices (MHI) are calculated by summarizing all of the hits from all the query sequences.
Q5. In Download page, there are environments such as "marine", "soil", and "gut." What are these terms? [Expand/Hide]
When uploading to the sequence archives, each metagenome was assigned a "scientific name" to show its "taxonomy" that can be found in NCBI taxonomy
Since there are no unified terms to describe the places of sampling, we used these "scientific names" to classify metagenomes as the environments they are from.
Q6. How were the 16S rRNA sequences generated? [Expand/Hide]
The flowchart is shown below as in the article
We collected metagenomic datasets from DDBJ Sequence Read Archive (DRA)
. Only those whose study types are "metagenomics" and platforms are "LS454" are selected.
Then we marked each read by the environment (a.k.a., scientific name as explained in Q5
) of the metagenome.
We removed low-quality nucleotides, adaptors, ambiguous sequences, homopolymers, duplicates, and removed sequences shorter than 200 bp in each step.
To separate 16S rRNA sequences from other sequences in the metagenomic datasets, we used SortMeRNA
with reference 16S rRNA sequences from Silva
Then we use UCHIME
to remove chimera sequences.
The number 16S rRNA sequences is 5,137,512 after 16S rRNA sequence prediction and chimera removal.
Finally, we clustered the 16S rRNA sequences by CD-HIT
with 97% identity, generating 1,241,213 representative 16S rRNA sequences, which are described in Download page
Q7. I want to know the precise places where my microbes can be found. Why is that impossible? [Expand/Hide]
During the process of generating MetaMetaDB (See Q6
), all the 16S rRNA reads from the same environments were clustered.
The sequences that were clustered together are not necessarily from the same sample (a.k.a., the same place), and therefore it is not possible to know where exactly the 16S
rRNA sequences are from.