Metagenomics uses short-read DNA sequencing to measure the microbes in a sample. Typically, short-read DNA comes in pairs of sequences which are 100-200 base pairs long each. These reads are long enough to be matched to a catalog of genes or genomes allowing researchers to estimate the taxonomic composition of their sample. Unlike amplicon sequencing (which only sequences a particular marker gene) metagenomics allows more specific classification and accurate estimates of the genetic composition of a sample.
Though powerful short read DNA is expensive. Even with state of the art Illumina machines it can cost upward of several hundred dollars just for a few million reads. So the question naturally arises how many reads do you actually need?
There's no simple answer to this question in metagenomics: it will always depend on the parameters of your experiment and what you're trying to detect. If cost were not a factor there would never be a downside to more reads. However, since resources are constrained in the real world we can estimate how many reads we need to accurately detect low abundance taxa in our sample.
We can model the chance of detecting a low abundance microbe in our samples based on a given read depth. Let @@0@@ be the proportion of DNA unique to taxa @@1@@ in our sample. We can then model the number of reads we need to draw @@2@@ before drawing @@3@@ reads from taxa @@4@@ as a negative binomial distribution.
If we assume we must detect two reads from a given taxa to detect that taxa we can say.
Using the CDF we can state the probability that @@8@@ is less than a specified number of reads @@9@@.
The proportion of DNA which is uniquely assignable to a specific taxa varies by taxa rank and by clade. However, the typical definition specifies that a species has at least 3\% unique DNA. We can use this to state @@11@@ in terms of the abundance of taxa @@12@@, @@13@@.
We can simulate this process to get an idea of the results. in the code below we simulate 4 taxa at abundances of 1 in 10,000; 1 in 1,000; 1 in 100; and 1 in 10. We then measure how many reads we need to select from an imaginary infinitely deep sample before we would get a certain number of reads that could be specifically assigned to the species.
With 5,000,000 reads we will nearly always detect species at 1 part in 10,000, even if we require a relatively large number of reads as our minumum threshold of detection.
However, detection does not mean that we will accurately quantify the abundance of each taxa. Particularly for low abundance taxa we run the risk of misestimating the actual abundance. We can quantify this risk by simulating how many reads we draw from taxa of known abundances. The most important factor here is the expected number of reads from a taxa which is given by the unique proportion of the species, the relative abundance of the species in the sample, and the total number of reads in the sample.
Once we have about 100 expected reads our observed abundance starts to line up with the actual abundance. With 10 expected reads or fewer the error becomes extreme.
There is also an issue with dropout. If a species is low abundance we may miss it completely. We can simulate the chance of missing a taxa of aparticular relative abundance.
If we set 2 reads as the minimum detection threshold A 5M read sample will be missing very few of its (1 / 10,000) abundance species but a 1M read sample will be missing ~20% of these taxa.
We can take this further and plot the distance between simulated samples that are known to have the same actual composition.
This is a little fuzzier because real microbiomes are not even abundance (this is a worst case) but the rough takeaway is that two 5M read microbiomes within an L1 distance of 0.01 cannot be distinguished.
To effectively detect taxa that occur at one part in 10,000 in a sample should have 5,000,000 reads or more. For low abundance taxa, analyses should distinguish between abundances that can merely be detected and abundances that can be reliably estimated.