When we study microbial communities, we typically sequence DNA from a sample and figure
out the relative proportions of different microbes. Usually we find hundreds or thousands of
different ones, but let's look at a more simple table as an example. We have three samples
and four detected species.
In samples A and B, we have detected species 1, 2, and 3, but not 4. Thus, we can say that
sample A and B have the same microbial community composition. But, if we look closer we
see that in A, species 1 make up 90% of the sample while in B, species 2 make up 90%. So
samples A and B are perhaps not that similar after all.
If we compare samples B and C, we see that the two samples don't have the same species. Species 4 is lacking in B and species 3 is lacking in C. But, we also see that the species that make up 90% of sample is the same in both (species 2).
How should we compare the community composition in the samples? Are A and B more similar to each other than B and C, or vice versa?
It turns out there is a very nice systematic way of assessing the important of relative abundance on diversity and dissimilarity metrics. It is based on something called effective numbers or Hill numbers, which make it possible to tune the weight we give to relative abundance values. I think this Hill numbers should be used much more in microbial ecology.
I have developed a software (Python package) for calculation and visualization of diversity using the "Hill number" framework.
In the article Hill-based dissimilarity indices and null models for analysis of microbial community assebly published in the journal Microbiome, we describe how qdiv can be used to analyze microbial data sets and argue for the use of the "Hill number framework" in microbial ecology.